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Dedication 

This  volume  is  dedicated  to  Professor  Patrick  Parks  who  died  in  1995.  Patrick  was 
famous  both  for  his  work  in  nonlinear  stability  and  automatic  control  and  for  his 
more  recent  contributions  to  neural  networks — especially  on  learning  procedures 
for  CM  AC  systems.  Patrick  was  known  to  many  as  a  good  friend  and  colleague, 
and  as  a  gentleman,  and  he  is  greatly  missed. 


PREFACE 


This  volume  of  research  papers  comprises  the  proceedings  of  the  first  International 
Conference  on  Mathematics  of  Neural  Networks  and  Applications  (MANNA),  which 
was  held  at  Lady  Margaret  Hall,  Oxford  from  July  3rd  to  7th,  1995  and  attended  by 
116  people.  The  meeting  was  strongly  supported  and,  in  addition  to  a  stimulating 
academic  programme,  it  featured  a  delightful  venue,  excellent  food  and  accommo¬ 
dation,  a  full  social  programme  and  fine  weather  -  all  of  which  made  for  a  very 
enjoyable  week. 

This  was  the  first  meeting  with  this  title  and  it  was  run  under  the  auspices  of  the 
Universities  of  Huddersfield  and  Brighton,  with  sponsorship  from  the  US  Air  Force 
(European  Office  of  Aerospace  Research  and  Development)  and  the  London  Math¬ 
ematical  Society.  This  enabled  a  very  interesting  and  wide-ranging  conference  pro¬ 
gramme  to  be  offered.  We  sincerely  thank  all  these  organisations,  USAF-EOARD, 
LMS,  and  Universities  of  Huddersfield  and  Brighton  for  their  invaluable  support. 
The  conference  organisers  were  John  Mason  (Huddersfield)  and  Steve  Ellacott 
(Brighton),  supported  by  a  programme  committee  consisting  of  Nigel  Allinson 
(UMIST),  Norman  Biggs  (London  School  of  Economics),  Chris  Bishop  (Aston), 
David  Lowe  (Aston),  Patrick  Parks  (Oxford),  John  Taylor  (King’s  College,  Lon¬ 
don)  and  Kevin  Warwick  (Reading).  The  local  organiser  from  Huddersfield  was 
Ros  Hawkins,  who  took  responsibility  for  much  of  the  administration  with  great 
efficiency  and  energy.  The  Lady  Margaret  Hall  organisation  was  led  by  their  bursar, 
Jeanette  Griffiths,  who  ensured  that  the  week  was  very  smoothly  run. 

It  was  very  sad  that  Professor  Patrick  Parks  died  shortly  before  the  conference.  He 
made  important  contributions  to  the  field  and  was  to  have  given  an  invited  talk  at 
the  meeting. 

Leading  the  academic  programme  at  the  meeting  were  nine  invited  speakers.  Nigel 
Allinson  (UMIST),  Shun-ichi  Amari  (Tokyo),  Norman  Biggs  (LSE),  George  Cy- 
benko  (Dartmouth),  Frederico  Girosi  (MIT),  Stephen  Grossberg  (Boston),  Morris 
Hirsch  (Berkeley),  Helge  Ritter  (Bielefeld)  and  John  Taylor  (King’s  College,  Lon¬ 
don).  The  supporting  programme  was  substantial;  out  of  about  110  who  submitted 
abstracts,  78  delegates  were  offered  and  accepted  opportunities  to  contribute  pa¬ 
pers.  An  abundance  of  relevant  topics  and  areas  were  therefore  covered,  which  was 
indeed  one  of  the  primary  objectives. 

The  main  aim  of  the  conference  and  of  this  volume  was  to  bring  together  researchers 
and  their  work  in  the  many  areas  in  which  mathematics  impinges  on  and  contributes 
to  neural  networks,  including  a  number  of  key  applications  areas,  in  order  to  expose 
current  research  and  stimulate  new  ideas.  We  believe  that  this  aim  was  achieved. 
In  particular  the  meeting  attracted  significant  contributions  in  such  mathematical 
aspects  as  statistics  and  probability,  statistical  mechanics,  dynamics,  mathematical 
biology  and  neural  sciences,  approximation  theory  and  numerical  analysis:  alge- 
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bra,  geometry  and  combinatorics,  and  control  theory.  It  also  covered  a  considerable 
range  of  neural  network  topics  in  such  areas  as  learning  and  training,  neural  net¬ 
work  classifiers,  memory  based  networks,  self  organising  maps  and  unsupervised 
learning,  Hopfield  networks,  radial  basis  function  networks,  and  the  general  area 
of  neural  network  modelling  and  theory.  Finally,  applications  of  neural  networks 
were  considered  in  such  topics  as  chemistry,  speech  recognition,  automatic  control, 
nonlinear  programming,  medicine,  image  processing,  finance,  time  series,  and  dy¬ 
namics.  The  final  collection  of  papers  in  this  volume  consists  of  6  invited  papers  and 
over  60  contributed  papers,  selected  from  the  papers  presented  at  the  conference 
following  a  refereeing  procedure  of  both  the  talks  and  the  final  papers.  We  seriously 
considered  dividing  the  material  into  subject  areas,  but  in  the  end  decided  that  this 
would  be  arbitrary  and  difficult  -  since  so  many  papers  addressed  more  than  one 
key  area  or  issue. 

We  cannot  conclude  without  mentioning  some  social  aspects  of  the  conference.  The 
reception  in  the  atmospheric  Old  Library  at  LMH  was  accompanied  by  music  from 
a  fine  woodwind  duo,  Margaret  and  Richard  Thorne,  who  had  first  met  one  of  the 
organisers  during  a  sabbatical  visit  to  Canberra,  Australia!  The  conference  dinner 
was  memorable  for  preliminary  drinks  in  the  lovely  setting  of  the  Fellows  Garden, 
excellent  food,  and  an  inspirational  after-dinner  speech  by  Professor  John  Taylor. 
Finally  the  participants  found  some  time  to  investigate  the  local  area,  including  a 
group  excursion  to  Blenheim  Palace. 

We  must  finish  by  giving  further  broad  expressions  of  thanks  to  the  many  staff  at 
Universities  of  Huddersfield  and  Brighton,  US  Air  Force  (BOARD),  London  Mathe¬ 
matical  Society,  and  Lady  Margaret  Hall  who  helped  make  the  conference  possible. 
We  also  thank  the  publishers  for  their  co-operation  and  support  in  the  publication 
of  the  proceedings.  Finally  we  must  thank  all  the  authors  who  contributed  papers 
without  whom  this  volume  could  not  have  existed. 
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INVITED  PAPERS 


N-TUPLE  NEURAL  NETWORKS 

N.  M.  Allinson  and  A.  R.  Kolcz 

Department  of  Electrical  Engineering  and  Electronics,  UMIST,  Manchester,  UK. 

The  AT-Tuple  Neural  Network  (NTNN)  is  a  fast,  efficient  memory-based  neural  network  capable 
of  performing  non-linear  fimction  approximation  and  pattern  classification.  The  random  nature 
of  the  N- tuple  sampling  of  the  input  vectors  makes  precise  analysis  difficult.  Here,  the  NTNN 
is  considered  within  a  unifying  framework  of  the  General  Memory  Nem-al  Network  (GMNN)  — 
a  family  of  networks  which  include  such  important  types  as  radial  basis  function  networks.  By 
discussing  the  NTNN  within  such  a  framework,  a  clearer  imderstanding  of  its  operation  and 
efficient  application  cam  be  gained.  The  nature  of  the  intrinsic  tuple  distances,  and  the  resultant 
kernel,  is  also  discussed,  together  with  techniques  for  handling  non-binary  input  patterns.  An 
example  of  a  tuple-based  network,  which  is  a  simple  extension  of  the  conventional  NTNN,  is  shown 
to  yield  the  best  estimate  of  the  imderlying  regression  function,  ^(Fix),  for  a  finite  training  set. 
Finally,  the  pattern  classification  capabilities  of  the  NTNN  are  considered. 

1  Introduction 

The  origins  of  the  A^-tuple  neural  network  date  from  1959,  when  Bledsoe  and  Brown¬ 
ing  [1]  proposed  a  pattern  classification  system  that  employed  random  sampling  of  a 
binary  retina  by  taking  N-hii  long  ordered  samples  (i.e.,  7V-tuples)  from  the  retina. 
These  samples  form  the  addresses  to  a  number  of  memory  nodes  —  with  each  bit 
in  the  sample  corresponds  to  an  individual  address  line.  The  TV-tuple  sampling 
is  sensitive  to  correlations  occurring  between  different  regions  for  a  given  class  of 
input  patterns.  Certain  patterns  will  yield  regions  of  the  retina  where  the  prob¬ 
ability  of  a  particular  state  of  a  selected  TV-tuple  will  be  very  high  for  a  pattern 
class  (e.g.,  predominately  ‘white’  or  ‘black’  if  we  are  considering  binary  images  of 
textual  characters).  If  a  set  of  exemplar  patterns  is  presented  to  the  retina,  each  of 
the  TV-tuple  samples  can  be  thought  of  as  estimating  the  probability  of  occurrence 
of  its  individual  states  for  each  class.  A  cellular  neural  network  interpretation  of  TV¬ 
tuple  sampling  was  provided  by  Aleksander  [2];  and  as  we  attempt  to  demonstrate 
in  this  paper  its  architecture  conforms  to  what  we  term  as  the  General  Memory 
Neural  Network  (GMNN).  Though  the  TV-tuple  neural  network  is  more  commonly 
thought  of  as  a  supervised  pattern  classifier,  we  will  consider  first  the  general  prob¬ 
lem  of  approximating  a  function,  /,  which  exists  in  a  D-dimensional  real  space, 
IR^.  This  function  is  assumed  to  be  smooth  and  continuous  and  that  we  possess  a 
finite  number  of  sample  pairs  {(xj,  y*)  :  i  ~  1, 2  . . . ,  T}.  We  will  further  assume  that 
this  training  data  is  subject  to  a  noise  component,  that  is  yi  =  /(xi)  -f  where 
£  is  a  random  error  term  with  zero  mean.  A  variant  of  the  NTNN  for  function 
approximation  was  first  proposed  by  Tattersall  et  al  [3]  and  termed  the  Single- 
Layer- Lookup- Perceptron  (SLLUP).  The  essential  elements  of  the  SLLUP  are  the 
same  as  the  basic  NTNN  except  that  the  nodal  memories  contain  numeric  weights. 
A  further  extension  of  the  basic  NTNN,  originally  developed  by  Bledsoe  and  Bisson 
[4],  records  the  relative  frequencies  at  which  the  various  nodal  memories  are  ad¬ 
dressed  during  training.  The  network  introduced  in  Section  4  combines  aspects  of 
these  two  networks  and  follows  directly  from  the  consolidated  approach  presented 
in  Section  2. 

We  discuss,  in  Section  3,  some  details  of  the  NTNN  with  particular  reference  to  its 
mapping  between  sampled  patterns  on  the  retina  and  the  TV-tuple  distance  metric. 
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and  the  transformation  of  non-binary  element  vectors  onto  the  binary  retina.  The 
form  of  the  first  mapping,  which  is  an  approximately  exponential  function,  is  the 
kernel  function  of  the  NTNN  —  though  due  to  the  random  nature  of  the  sampling, 
this  must  be  considered  in  a  statistical  sense.  Finally,  a  brief  note  is  given  on 
a  Bayesian  approximation  that  indicate  how  these  networks  can  be  employed  as 
pattern  classifiers. 

2  The  General  Memory  Neural  Network 

Examples  of  GMNN  include  Radial  Basis  Function  (RBF)  [5]  networks  and  the 
General  Regression  Neural  Network  (GRNN)  [6].  These  networks  can  provide  pow¬ 
erful  approximation  capabilities  and  have  been  subject  to  rigorous  analysis.  A  fur¬ 
ther  class  of  networks,  of  which  the  NTNN  is  one,  have  not  been  treated  to  such 
detailed  examination.  However,  these  networks  (together  with  others  such  as  the 
CM  AC  [7]  and  the  Multi-Resolution  Network  [8])  are  computationally  very  efficient 
and  better  suited  to  hardware  implementation.  The  essential  architectural  compo¬ 
nents  of  GMNNs  are  a  layer  of  memory  nodes,  arranged  into  a  number  of  blocks, 
and  an  addressing  element  that  selects  the  set  of  locations  participating  in  the 
computation  of  the  output  response.  An  extended  version  of  this  section  is  given  in 
[9], 

2.1  Canonical  Form  of  the  General  Memory  Neural  Network 

The  GMNN  can  be  defined  in  terms  of  the  following  elements: 

■  A  set  of  K  memory  nodes,  each  possessing  a  finite  number  of  \Ak\  addressable 
locations. 

■  An  address  generator  which  assigns  an  address  vector 

A(x)  =  [Ai(x),  A2(x),  . . . ,  Ak{x)] 

to  each  input  point  x.  The  address  generated  for  the  kth.  memory  node  is 
denoted  by  j4jfc(x). 

■  The  network’s  output,  g,  is  obtained  by  combining  the  contents  of  selected 
memory  locations,  that  is 

[mi(.4i(x)),m2(v42(x)),...,mt(^i(x))]  (1) 

where  mfc(Ajb(x))  is  the  content  of  the  memory  location  selected  by  the  kth 
memory  node  by  the  address  generated  by  x  for  that  node  (this  will  be  iden¬ 
tified  as  simply  mjfc(x)).  No  specific  format  is  imposed  on  the  nature  of  the 
memories  other  than  that  the  format  is  uniform  for  all  K  nodes. 

■  A  learning  procedure  exists  which  permits  the  network  to  adjust  the  nodal 
memory  contents,  in  response  to  the  training  set,  so  that  some  error  criterion, 
’'■(/)  fif))  is  minimised. 

Each  node  of  the  network  performs  a  simple  vector  quantization  of  the  input  space 
into  |Ajk|  cells.  For  each  node,  the  address  generating  element  can  be  split  into 
an  index  generator  and  an  address  decoder.  The  index  generator,  Ik,  selects  a  cell 
for  every  x  €  ^  and  assigns  it  a  unique  index  value,  /jb(x)  G  {1, 2, . . . ,  |Ajb|}; 
hence  the  index  generator  identifies  the  quantization  cell  to  which  the  input  points 
belongs.  The  address  decoder  uses  this  generated  index  value  to  specify  the  physical 
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memory  address  which  then  selects  the  relevant  node  k.  Hence,  74jb(x)  —  Ak{Ik(^))- 
Therefore,  a  cell,  Rf,  can  be  defined  as  the  set  of  all  input  points  which  result  in 
the  selection  of  an  address  corresponding  to  the  iih  index  of  the  kth.  node. 

=  {x  e  n :  4(x)  =  0  (2) 

Each  of  the  cells  is  closed  and  bounded  as  the  input  space  is  compact  in  IR^ .  The 
selection  of  a  cell  is  given  by  the  following  operator  or  activation  function 


5*(x)  =  (4(x)  =  0  = 


if  X  G  Rf  :i  =  I, 
otherwise 


The  quantization  of  Q  performed  by  the  individual  nodes  is  combined,  through  the 
intersection  of  the  K  cells  being  superimposed,  to  yield  a  global  quantizer.  The 
number  of  cells  \A\  is  given  by  the  number  of  all  such  distinct  intersections. 
lAil  l>l2|  lAfcl  K 

i.4i = ^  ^  ^  n(-{*  s  ^  ■  -f*  w = *<’}  ^  ®)  (^) 

ii  =  l  12  =  1  ifc  =  l  k  =  l 

The  upper  bound  being  given  by  l^lmeix  =  IlfLi  1^*1*  address  generation 
element  of  the  network  is  distributed  across  the  nodes,  so  that  the  general  structure 
of  Figure  la  emerges.  Alternatively,  the  address  generation  can  be  considered  at 
the  global  level  (Figure  lb).  These  two  variants  are  equivalent. 

The  quantization  of  the  input  space  by  the  network  produces  values  that  are  con¬ 
stant  over  each  cell  (We  are  ignoring,  for  the  present,  external  kernel  functions). 
The  value  of  /  assigned  to  the  zth  ceil  is  normally  expressed  as  the  average  value 
of  /  over  the  cell. 

^  ‘  /«./4u)du . '  (5) 

where  dj  is  given  by  the  squared  error  function.  In  most  supervised  learning  schemes, 
this  representation  of  f{Ri)  is  estimated  inherently  through  the  minimisation  of  an 
error  function. 

For  A  =  1,  the  GMNN  could  be  simply  replaced  to  a  VQ  followed  by  a  look-up 
table.  There  would  need  to  be  at  least  one  input  point  per  quantization  cell.  The 
granularity  of  the  quantization  needs  to  be  sufficient  to  meet  the  degree  of  approx¬ 
imation  performance  appropriate  for  the  required  task.  When  there  are  multiple 
nodes  {K  >  1),  the  quantization  at  the  node  level  can  be  much  coarser  which,  in 
turn,  increases  the  probability  of  input  points  lying  inside  a  cell.  The  fine  granu¬ 
larity  being  achieved  through  the  superposition  of  nodal  quantizers.  Learning  and 
generalisation  are  only  possible  through  the  use  of  multiple  nodes.  Points  that  are 
close  to  each  other  in  the  Q  space  should  share  many  addresses,  and  vice  versa. 

2.2  GMNN  Distance  Metrics 

The  address  proximity  function,  which  quantifies  the  proximity  of  points  in  Q  and 
so  the  number  of  generated  identical  nodal  addresses,  is  given  by 


{0,1.. .../O  K(x,y)  =  y^(^,(x)  =  .4,(y))  =  y^(/,(x)  =  4(y))  (6) 
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Figure  1  (a)  GMNN  —  address  gener¬ 

ation  considered  at  the  nodal  level,  (b) 
GMNN  —  address  generation  considered 
at  the  global  level.  These  two  variants  are 
identical  in  terms  of  overall  fvmctionahty. 


Figure  2  General  structure  of  an 
NTNN. 


The  address  distance  function,  defined  as  the  number  of  different  generated  nodal 
addresses,  is  given  by 

K  K 

. K}  p(x,y)  =  ^(Ai(x)  ^  ^,(y))  =  ^(h(x)  #  h(y))  (7) 

fc=l 

The  binary  nodal  incidence  function,  which  returns  T’  if  two  inputs  share  a  common 
address  at  a  given  network  node  and  ‘0’  otherwise,  is  defined  as 

From  these  definitions,  several  properties  directly  follow. 

V(x,  yen)  p{x,  x)  =  0  p{x,  y)  =  p(y,  x) 

k(x,  x)  =  K  k{x,  y)  =  K(y,  x)  (9) 

k(x,  y)  =  K  -  p{x,  y) 

K 

Y^Mkix,  y)  -  (c(x,  y) 

Jb  =  l 


(10) 
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2.3  GMNN  Error-Based  Training 

If  the  address  generation  elements  of  the  GMNN  have  been  established  (based  on 
some  a  priori  knowledge  about  the  function  to  be  approximated),  then  the  only 
element  which  is  modifiable  through  training  is  the  contents  of  the  nodal  memory 
locations  (e.g.,  the  weights).  If  these  locations  contain  real- valued  numbers  and  the 
output  of  the  network  is  formed  by  summing  these  numbers,  then  the  response 
of  the  GMNN  is  linear  in  terms  of  this  weight  space.  Learning  methods  based  on 
the  minimisation  of  the  square  error  function  are  guaranteed  to  converge  under 
these  general  circumstances.  In  the  following  analysis,  we  will  therefore  assume  an 
iterative  LMS  learning  schedule.  For  a  finite  training  set  of  T  paired  samples,  the 
error  produced  by  the  network  for  the  jth  presentation  of  the  zth  training  sample 
is  given  by 

(11) 

Jfc  =  l 

where  Wk{x.^)  is  the  value  of  the  weight  selected  by  x*  at  the  kth  node.  The  par¬ 
ticipating  weights  are  modified  by  a  value,  AJ ,  proportional  to  this  error  so  as  to 
reduce  the  error. 

w,{x')^Wk{x^)  +  Ay.k  =  l,2,...,K  (12) 

Initially,  all  weight  values  are  set  to  zero.  As  i(;jfc(x*)  is  shared  by  all  points  within 
the  neighbourhood  Aa;(x*),  this  weight  updating  can  affect  the  network  response 
for  points  from  outside  the  training  set.  That  is  within  an  input  space  region  given 

Wk{x)  W),{x)  +  A)  ■  Mk{x,x')  (13) 

The  output  of  the  network  after  the  training  is  complete,  for  an  arbitrary  x  E 
will  depend  on  all  training  samples  lying  within  the  neighbourhood  of  x. 

K  K  T  /  T'  \ 

g{x)  =  ;£  t„,(x)  =  ^  ^  M,(x.  x‘)  £  Aj  (14) 

k=\  t=ii=i  \  i=i  j 

Rearranging  this  expression  and  using  the  identity  (10),  yields 

jT  T  K  T  T' 

=  E  E  E  *’)  =  E  *')-  E  (15) 

t  =  l  j  —  l  A:  =  l  i=l  j  =  l 

We  can  compare  this  result  with  the  response  of  a  trained  RBF  network. 

T 

g{x)  =  '^w‘ ■  k{x,x')  (16) 

X  =  1 

For  the  normalised  RBF  network,  the  response  is  given  by 

T 

•k(x,x’) 

<,(x)  =  - 

E«(x,x‘) 

i=l 


(17) 
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and  for  the  GRNN  (where  the  training  set  response  values  replace  the  weight  val¬ 
ues). 

T 

■/c(x,x’) 

SW  =  ^ -  (18) 

^k(x,x’) 

i=:l 

Though  there  are  obvious  similarities,  we  can  further  extend  the  functionality  of  the 
GMNN  by  incorporating  into  each  nodal  memory  an  integer  counter.  This  nodal 
counter  is  incremented  whenever  the  node  is  addressed,  that  is  Cjk(x*)  Cjfc(x*)  -f  1 
(Initially,  all  counter  values  are  set  to  zero).  Now  the  response  of  the  GMNN  is 
given  by 

T 

y^A’-K(x,x’') 

^ -  (19) 

^k(x,x‘) 

i=l 

The  trained  GMNN  is  equivalent  to  a  set  of  T  units,  each  centred  at  one  of  the  train¬ 
ing  samples,  x*,  and  possessing  height  A*  and  kernel  weighting  function  «(-,  x*).  To 
complete  the  equivalence  of  the  GMNN  and  the  GRNN,  k  must  satisfy  the  general 
conditions  imposed  on  kernel  functions  [10]. 

2.4  Introduction  of  External  Kernels 

The  network  output  is  given  by  the  sum  of  weights  corresponding  to  the  selected 
locations,  but  the  output  remains  constant  over  each  quantization  cell  —  regardless 
of  the  relative  position  of  the  input  point  inside  a  cell.  The  network  mapping  thus 
becomes  discontinuous  at  the  cell  boundaries.  A  solution  would  be  to  emphasise 
weights  that  correspond  to  quantization  cells  that  are  closer  to  the  current  excita¬ 
tion,  X,  than  others.  This  distance  can  be  defined  in  terms  of  the  distance  between 
X  and  the  centroid  of  (where  i  =  /jfc(x)). 

d(x,flf)  =  d(x,cf)  (20) 

A  smooth,  continuous  and  monotonically  decreasing  kernel  function  is  then  intro¬ 
duced  to  weight  the  contributions  of  the  respective  nodes  to  the  values  of  d(x,  RJ^(x)), 
where  Rf(x)  is  the  cell  selected  by  x  for  the  kth  node.  The  network  output  now 
becomes 

K  lAfcl 

ffW  =  E  E  -R-- w))  •  (21) 

*=1 1=1 

A  set  of  K  •  |j4jfc|  kernel  or  basis  functions  can  be  defined,  with  centres  given  by 
the  centroid  set  {cf }  and  where  each  kernel  has  its  support  truncated  to  its  corre¬ 
sponding  quantization  region.  The  network  mapping  can  be  expressed  as 

K  \Ak\ 

sW  =  EE“.' 

A;=l  t=l 


(22) 
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or  in  its  normalised  form  as 

K  \Ak\ 

=  -  (23) 

Jb=l  t=l 

(^f(x)  is  the  kernel  function  associated  with  the  ith  quantization  cell  of  the  kth 
node  and  truncated  to  zero  at  its  cell  boundaries.  Gaussian  kernels  provide  an 
approximation  to  this  laist  condition,  though  B-spline  kernels  [11]  can  lead  to  the 
total  absence  of  cell  discontinuities.  The  introduction  of  external  weighting  kernels 
is  the  final  step  in  the  GMNN  architecture. 

3  The  A^-Tuple  Neural  Network 

The  general  form  of  the  NTNN  was  described  in  the  introduction  and  is  shown  in 
Figure  2.  The  following  two  sections  consider  the  mapping  functions  inherent  to 
this  network.  Namely: 

■  Conversion  of  the  input  vector  into  the  binary  format  needed  for  the  retina 

■  Sampling  the  retina  by  taking  N  bit  values  at  a  time  to  the  address  of  one  of 
the  K  memory  nodes. 

There  is  some  choice  in  what  form  the  first  of  these  mappings  may  take  depending 
on  the  application,  but  the  retinal  //-tuple  sampling  is  common  to  all  NTNNs. 
Figure  3a  indicates  how  the  threshold  decision  planes  of  the  individual  elements 
of  a  tuple  delineate  the  input  space  into  discrete  regions  and  why  the  Hamming 
distance  between  tuple  values  is  the  obvious  choice  for  a  distance  metric.  Further 
details  of  the  //-tuple  distance  metric  and  input  encoding  are  given  in  [12,  13]. 

3.1  Retinal  Sampling 

The  relationship  between  the  number  of  different  addresses  generated  for  two  ar¬ 
bitrary  inputs,  X  and  y,  and  the  Hamming  distance  //(x,  y)  (i.e.,  the  number  of 
bits  for  which  x  and  y  differ)  is  important  as  it  reveals  the  nature  of  the  distance 
metric  necessary  when  a  NTNN  is  used  for  pattern  classification  and  the  form  of 
the  kernel  for  the  approximation-NTNN.  This  relationship  can  only  be  expressed 
in  terms  of  an  expectation  due  the  random  nature  of  the  sampling.  For  sampling 
without  repetitions,  the  expected  value  of  PNTNN(^’y)  “  ^NTNN(^)  given  by 

■®(/’ntnn('^))  “  j 

For  small  values  of  H,  this  can  be  simplified  to 

^(/>NTNN(^))  ^  K  (^l-  exp  (2^) 

For  sampling  with  repetitions,  the  expected  value  is 

^(/>NTNN(^))  =  -  exp  ^//  •  -  y  -H  - (26) 

where  h  is  the  normalised  Hamming  distance  (=  H/R).  Again,  this  can  be  simplified 
for  small  values  of  h,  to  yield 

■®(^NTNn('^))  ^  ~  exp(-JV  •  ft)) 


(27) 
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Figure  3  (a)  The  delineation  of  input  space  by  the  hard  decision  planes  of  each 

tuple  element’s  threshold.  Each  region  is  marked  by  its  specific  binary  state  of 
the  3-tuple,  ti.  (b)  The  thermometer  coding  inherent  in  iV-tuple  sampling.  The 
variable,  xi ,  is  tmiformly  quantized  into  six  discrete  regions  [L  =  6).  The  indicated 
2-tuple  partitions  this  interval  into  three  rmequal  quantization  regions,  with  the 
binary  state  of  the  2-tuple  indicated,  (c)  The  delineation  of  the  3-dimensional  input 
space  into  tetrahedral  regions  through  the  use  of  a  ranking  code.  The  binary  space 
representation  of  the  input  space  is  also  shown. 


There  is  little  difference  in  the  general  form  of  these  two  sampling  methods,  though 
there  may  be  crucial  differences  in  performance  for  specific  tasks.  The  distance 
function  depends  exponentially  on  the  ratio  of  the  Hamming  distance,  ,  between 
patterns  to  the  retinal  size,  R.  The  rate  of  decrease  is  proportional  to  the  tuple 
size. 

3.2  Input  Encoding 

There  is  a  direct  and  monotonic  dependence  of  Hamming  distance  in 

the  binary  space  of  the  network’s  retina.  For  binary  patterns,  the  AT-tuple  sampling 
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provides  the  desired  mapping  between  the  input  and  output  addresses.  For  non- 
binary  input  patterns,  the  situation  is  not  so  clear.  One  obvious  solution  is  to  use 
a  thermometer  or  bar-chart  code,  where  one  bit  is  associated  to  every  level  of  an 
input  integer.  This  creates  a  linear  array  of  2”  bits  for  an  n-bit  long  integer.  This 
can  produce  very  large  retinas  if  the  input  dimensionality  and  quantization  level  are 
large.  The  use  of  the  natural  binary  code  or  Gray  code  is  not  feasible.  Though  these 
are  compact  codes,  there  is  no  monotonic  relationship  between  input  and  pattern 
distances.  The  concatenation  of  several  Gray  codes  [3]  offers  an  improvement  over 
a  limited  region  and  enhances  the  dynamic  range  over  the  binary  and  straight  Gray 
code.  The  exponential  dependence  of  the  PjvfXNN  Hamming  distance  means 

that  strict  proportionality  is  not  required  but  monotonicity  is  required  within  an 
active  region  of  Hamming  distances. 

The  potential  of  CMAC  encoding,  and  further  aspects  of  input  coding  methods,  are 
discussed  in  Kolcz  and  Allinson  [14].  Improvements  in  the  input  mapping,  which  in 
turn  produce  a  more  isotropic  kernel,  are  given  in  Kolcz  and  Allinson  [15],  where 
rotation  and  hyper-sphere  codings  are  described.  Two  further  techniques  will  be 
briefly  introduced  here  in  order  to  indicate  the  wide  range  of  possible  sampling  and 
coding  schemes.  Figure  3b  shows  one  input  variable,  aji,  which  is  uniformly  quan¬ 
tized  to  six  levels  and  this  is  sampled  by  the  indicated  2-tuple.  The  corresponding 
states  of  the  resultant  tuple  for  the  three  resulting  sub-intervals  indicate  that  ther¬ 
mometer  encoding  can  be  inherent  in  tuple  sampling.  This  concept  can  be  extended 
to  the  multivariate  case.  If  the  input  space,  Q,  is  a  D- dimensional  hypercube  and 
each  memory  node  distributes  its  N  address  lines  among  these  dimensions,  then 
the  space  is  effectively  quantized  into  nd=i(-^<i  "b  1)  hyper-rectangular  cells.  This 
Eissumes  random  sampling,  such  that  there  are  Nd  address  lines  per  input  dimen¬ 
sions  (where  Nd  =  N/D),  The  placement  of  tuples  can  be  very  flexible  (i.e.,  uniform 
quantization  is  not  essential)  and  the  sampling  process  can  take  into  account  the 
density  of  training  points  within  the  input  space. 

In  rank-order  coding,  the  non-binary  A’-tuple  is  transformed  to  a  tuple  of  ranked 
values  (e.g.,  {20,  63,  40,  84,  122,  38)  becomes,  in  ascending  order,  the  ranked  tuple 
(0,  3,  2,  4,  5,  1)).  Each  possible  ordering  is  assigned  a  unique  consecutive  ranking 
number,  which  is  converted  to  binary  format  and  then  used  as  the  retinal  input. 
Rank-order  coding  produces  an  equal  significance  code.  The  use  of  these  relation¬ 
ships  is  equivalent  to  delineating  the  input  space  into  hyper-tetrahedrons  rather 
than  the  usual  hyper-rectangles  (Figure  3c). 

4  A^-tuple  Regression  Network 

The  framework  for  GMNN  proposed  earlier  together  with  the  derivation  the  tuple 
distance  metric  are  employed  here  in  the  development  of  a  modified  NTNN  which 
operates  as  a  non-par ametric  regression  estimator.  The  formal  derivation  of  this 
network  and  that  the  iV-tuple  kernel  is  a  valid  one  for  estimating  the  underlying 
probability  function  is  given  in  Kolcz  and  Allinson  [16].  The  purpose  of  this  section 
is  to  show  the  relative  simplicity  of  this  network  compared  with  other  implementa¬ 
tions. 

During  the  training  phase,  the  network  is  presented  with  T  data  pairs,  (x*,y*), 
where  x*  is  the  D- dimensional  input  vector  and  y*  is  the  corresponding  scalar 
output  of  the  system  under  consideration.  The  input  vector  is  represented  by  a 
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unique  selection  of  the  K  tuple  addresses  with  their  associated  weight  and  counter 
values. 

f  0i(x),/2(x),...,tx(x)} 

X  <  {u;i(x),  it;2(x), . . . ,  u^iir(x)}  (28) 

[  {ai(x),a2(x),...,a/i:(x)} 

During  training,  each  addressed  tuple  location  is  updated  according  to 

iyjfc(x*)  ^  tnjfc(x*)  +  and  ajk(x*)  ^  a)b(x*)  +  1  (29) 

for  i  =  1 , 2, . . . ,  T  and  k  =  1,2,. .  .,K 

Initially  all  weight  and  counter  values  are  set  to  zero.  After  training,  the  network 
output,  y(x),  is  obtained  from 

y(x)  =  -  (30) 

^a^(x) 

fc=l 

An  additional  condition  is  where  all  addressed  locations  are  zero.  In  this  case,  the 
output  is  set  to  zero. 

K 

ai(x)  =  0  -+  j((x)  =  0  (31) 

fe  =  l 

Figure  4  shows  the  modifications  needed  to  a  conventional  NTNN  to  form  the  N- 


Figure  4  Modifications  to  the  nodes  and  output  elements  of  the  NTNN  to  yield 
the  iV-tuple  regression  network. 

tuple  regression  network.  By  considering  the  tuple  distances  between  inputs,  as 
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defined  in  terms  of  the  number  of  different  tuple  addresses  generated,  then  (30)  can 
be  extended  to 

ii(x)  =  if —  =  av- — m 

EMx)  E 

jfc=l  i=l  ^  / 

This  suggests  that  the  network  output  is  an  approximate  solution  of  the  gener¬ 
alised  regression  function,  ^(y|x),  provided  that  the  bracketed  term  in  (32)  is  a 
valid  kernel  function.  This  function  is  continuous,  symmetrical,  non-negative  and 
possesses  finite  support.  These  are  all  necessary  conditions.  A  close  approximation 
(based  on  the  exponential  approximation  of  tuple  distances)  is  also  representable 
as  a  product  of  univariate  kernel  functions.  Taken  together  these  provide  sufficient 
conditions  for  a  valid  kernel  function  [17].  A  wide  ranging  set  of  experiments  on 
chaotic  time-series  prediction  and  non-linear  system  modelling  has  been  conducted 
[16],  which  confirm  the  successful  operation  of  this  network.  A  major  advantage  of 
the  NTNN  implementation  over  other  approaches  is  its  fast,  and  fixed,  speed  of 
operation.  Each  recall  operation  involves  addressing  a  fixed  number  of  locations. 
There  is  no  need  for  preprocessing  large  data  sets,  through  data  clustering,  as  is 
often  the  case  for  RBF  networks  [18]. 


5  Pattern  Classification 

So  far  we  have  restricted  our  considerations  to  the  approximation  properties  of  the 
NTNN,  but  the  other  major  application  —  namely,  pattern  classification  —  can 
be  discussed  within  this  common  framework.  The  training  phase  of  a  supervised 
network  provides  estimates  of  the  conditional  probabilities  of  individual  pattern 
classes.  The  class  membership  probabilities  can  be  formulated  through  the  Bayes 
relationship,  i.e., 

=  (33) 

where  c  is  the  class  label  for  a  particular  class  {c  —  1,2,  ...,C}.  The  modified 
NTNN  discussed  in  Section  4  can  be  reformulated  in  terms  of  this  classification. 
The  network  through  training  approximates  C  indicator  functions,  which  denote 
membership  to  an  individual  class. 

'•w  =  {  I  <’*) 

Modifying  (32),  the  indicator  functions  can  be  approximated,  after  training,  by 


Cc){1-p(k,x‘)/K)  ^P(ii|c) 


/c(x)  = 


^(l-p(x,x‘)/A’) 


=  p{cy 


Em) 


This  relationship  gives  the  ratio  of  the  cumulative  summation  of  all  training  points 
belonging  to  a  class  c,  which  have  an  iV-tuple  distance  at  0, 1, . . . ,  (/f  —  1)  from 
X  to  a  similar  cumulative  summation  for  all  training  points.  The  decision  surfaces 
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present  in  the  AT- dimensional  weight  space  are  described  by  Wk  =  const,  and 

the  winning  class  is  given  by  c^inner  =  Ef=i 

6  Conclusions 

The  unifying  approach  proposed  for  a  wide  class  of  memory-based  neural  networks 
means  that  practical,  but  poorly  understood,  networks  (such  as  the  NTNN)  can 
be  considered  in  direct  comparison  with  networks  (such  as  RBF  networks)  that 
possess  a  much  firmer  theoretical  foundation.  The  random  sampling  inherent  in 
the  A'-tuple  approach  makes  detailed  analysis  difficult  so  this  link  is  all  the  more 
important.  The  pragmatic  advantages  of  NTNNs  has  been  demonstrated  in  the 
regression  network  described  above,  where  large  data-sets  can  be  accommodated 
with  fixed  computational  overheads.  The  possible  range  of  input  sampling  and 
encoding  strategies  has  been  illustrated,  but  by  no  means  exhaustively.  There  is 
still  a  need  to  seek  other  strategies  that  will  provide  optimum  kernel  functions  for 
specified  recognition  or  approximation  tasks.  The  power  and  flexibility  of  Bledsoe 
and  Browning’s  original  concept  has  not,  as  yet,  been  fully  exploited. 
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The  set  of  all  the  neural  networks  of  a  fixed  architecture  forms  a  geometrical  manifold  where 
the  modifable  connection  weights  play  the  role  of  coordinates.  It  is  important  to  study  all  such 
networks  as  a  whole  rather  than  the  behavior  of  each  network  in  order  to  \mderstand  the  capability 
of  information  processing  of  neural  networks.  What  is  the  natural  geometry  to  be  introduced  in 
the  manifold  of  neural  networks?  Information  geometry  gives  an  ^mswer,  giving  the  Riemannian 
metric  ^lnd  a  dual  pair  of  affine  connections.  An  overview  is  given  to  information  geometry  of 
neural  networks. 

1  Introduction  to  Neural  Manifolds 

Let  us  consider  a  neural  network  of  fixed  architecture  specified  by  parameters  w  = 
(u;i,  •  •  ■ ,  Wp)  which  represent  the  connection  weights  and  thresholds  of  the  network. 
The  parameters  are  usually  modifiable  by  learning.  The  set  N  of  all  such  networks 
is  considered  a  p-dimensional  neural  manifold,  where  in  is  a  coordinate  system  in 
N.  Because  it  includes  all  the  possible  networks  belonging  to  that  architecture,  the 
total  capabilities  of  the  networks  are  made  clear  by  studying  the  manifold  N  itself. 
To  be  specific,  let  N  be  the  set  of  multilayer  feedforward  networks  each  of  which 
receives  an  input  x  and  emits  an  output  z.  The  input-output  relation  is  described 
as 

^  =  f(x;w) 


where  the  total  output  z  depends  on  w  which  describes  all  the  connection  weights 
and  thresholds  of  the  hidden  and  output  neurons.  Let  us  consider  the  space  S  of 
all  the  square  integrable  functions  of  x 

S  = 

and  assume  for  the  moment  that  f{x',  w)  is  square  integrable.  The  set  S  is  infinite¬ 
dimensional  L2  space,  and  the  neural  manifold  is  a  part  of  it,  that  is,  a  p- 
dimensional  subspace  embedded  in  5.  This  shows  that  not  all  the  functions  are 
realizable  by  neural  networks. 

Given  a  function  fc(a:),  we  would  like  to  find  a  neural  network  whose  behavior 
f{x;  w)  approximates  k{x)  as  well  as  possible.  The  best  approximation  is  given  by 
projecting  fc(a;)  to  N  in  the  entire  space  S.  The  approximation  power  depends  on 
the  shape  of  N  in  S.  This  shows  that  geometrical  considerations  are  important. 
When  the  behavior  of  a  network  is  stocheistic,  it  is  given  by  the  conditional  proba¬ 
bility  p(z\x]  w)  of  z  conditioned  on  input  x,  where  w  is  the  network  parameters.  A 
typical  example  is  a  network  composed  of  binary  stochatic  neurons:  The  probability 
z  =  1  of  such  a  stochastic  neuron  is  given  by 


Prob{2:  =  1;  ii;} 


exp{w  ■  x} 

1  -b  exp{it;  •  a;}’ 


(1) 


where  2  —  0  or  1.  Another  typical  case  is  a  noise-contaminated  network  whose 


output  z  is  written  as 


=  f{x;w)-\-n, 


(2) 
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where  n  is  a  random  noise  independent  of  lo.  If  n  is  subject  to  the  normal  distri¬ 
bution  with  mean  0  and  covariance  I  being  the  identity  matrix, 

the  conditional  probability  is  given  by 

p{z\x;w)  =  const  exp  |  .  (3) 

When  input  signals  x  are  produced  independently  from  a  distribution  q(x),  the 
joint  distribution  of  {x,  z)  is  given  by 

p{x,z;w)  =  q(x)p{z\x-,w).  (4) 

Let  S  be  the  set  of  all  the  conditional  probability  distributions  (or  joint  distri¬ 
butions).  Let  q{z\x)  be  an  arbitrary  probability  distribution  in  S  which  is  to  be 
approximated  by  a  stochastic  neural  network  of  the  behavior  p{z\x;  w).  The  neural 
manifold  N  consists  of  all  the  conditional  probability  distributions  p{z\x;w)  (or 
the  joint  probability  distributions  p{x,  z]  w))  and  is  a  p-dimensional  submanifold  of 
S.  A  fundamental  question  arises  :  What  is  the  natural  geometry  of  S  and  A?  How 
should  the  distance  between  two  distributions  be  measured?  What  is  the  geodesic 
connecting  two  distributions?  It  is  important  to  have  a  definite  answer  to  these 
problems  not  only  for  studying  stochastic  networks  but  also  for  deterministic  net¬ 
works  which  are  not  free  of  random  noises  and  whose  stochastic  interpretation  is 
sometimes  very  useful.  Information  geometry  ([3],  [17])  answers  all  of  these  prob¬ 
lems. 


2  A  Short  Review  of  Information  Geometry 

Let  us  consider  probability  distributions  p(y,^)  of  random  variable  y,  where  £  = 
(^i5-'')?n)  is  the  parameters  to  specify  a  distribution.  When  y  is  a  scalar  and 
normally  distributed,  we  have 


p{y,  0  = 


exp 


where  £  =  (//,  <t)  is  the  parameters  to  specify  it.  When 
on  {0, 1,  ■  ■  ■ ,  n},  we  have 


is  discrete,  taking  values 


i=l  \  i=l  / 

where  6i(y)  =  1  when  y  =  i  and  is  equal  to  0  otherwise  and  =  Probji/  =  2}. 
Let  S  be  the  set  of  such  distributions 


5  =  {p(y,0} 

specified  by  Then,  S  can  be  regarded  as  an  n-dimensional  manifold,  where  ^  is 
a  coordinate  system.  If  we  can  introduce  a  distance  measure  between  two  nearby 
points  specified  by  $  and  ^  -|-  by  the  quadratic  form 

=  (5) 

the  manifold  S  is  said  to  be  Riemannian.  The  quantity  {gijiO}  is  the  Rieman- 
nian  metric  tensor.  Another  important  concept  is  the  affine  connection  by  which 
a  geodesic  line  (an  extention  of  the  concept  of  the  straight  line  in  the  Euclidean 


17 


Amari:  Information  Geometry  of  Neural  Networks 

geometry)  is  defined.  The  affine  connection  is  defined  by  the  covariant  derivative, 
and  is  represented  by  the  three-index  parameters  r*jjb(4)  formally  defined  by 

Tijk  =  (Ve,ej,€fc)  (6) 

where  e*  =  d/d^i  is  the  natural  basis  vector  field,  {•,  •)  is  the  inner  product  and  V 
is  the  covariant  derivative.  The  metric  tensor  is  written  as 

{ei,ej)=gij.  (7) 

In  the  Riemannian  geometry,  the  Levi-Civita  connection 

fo)  \  (  d  d  d  \  .Q. 

^  2  '''  ~  j 

is  used  usually.  This  is  the  only  torsion-free  metric  connection  satisfying 

X{Y,  Z)  =  {VxY,  Z)  -h  {y,  VxZ),  (9) 

for  vector  fieds,  X,  Y  and  Z.  More  intuitively,  a  geodesic  is  the  minimum  distance 
curve  connecting  two  points  under  the  Levi-Civita  connection. 

Information  geometry  ([3],  [17])  defines  a  new  pair  of  affine  connections  given, 
respectively,  by 

rl;i(C)  =  (10) 

+  (11) 

where  is  the  tensor  defined  by 

Ti,*  =  [  A  logp  A  iogp|-  logp]  .  (12) 

They  are  called  the  e-  and  m-connection,  respectively,  and  are  dual  in  the  sense  of 
X{Y,  Z)  =  (V^ly,  Z)  +  (Y,  V^x^Z).  (13) 

The  duality  of  connections  is  a  new  concept  given  rise  to  by  information  geometry. 
When  a  manifold  has  a  dually  flat  structure  in  the  sense  that  the  e-  and  m-Riemann- 
Christoffel  curvature  vanish  (but  the  Levi-Civita  connection  has  non-vanishing  cur¬ 
vature  in  general),  it  has  remarkable  properties  that  are  dual  extensions  of  Euclidean 
properties.  Exponential  families  of  probability  distributions  are  proved  to  be  du¬ 
ally  flat.  Intuitively  speaking,  the  set  of  all  the  probability  distributions  {p(2/)}  are 
dually  flat,  and  any  parameterized  model  S  =  {p(y,  $)}  ^  curved  subman¬ 

ifold  embedded  in  it.  Hence,  it  is  important  to  study  properties  of  the  dually  fiat 
manifold. 

When  S  is  dually  flat,  the  divergence  function 

D{i  =  J  p{y,  0  log  (14) 

is  naturally  and  automatically  introduced  to  5.  This  is  an  extension  of  the  square 
of  the  Riemannian  distance,  because  it  satisfies 

+  =  (15) 

The  divergence  satisfies  ^  ^  equality  when  and  only  when  ^ 

But  it  is  not  symmetric  in  general. 

We  have  the  generalized  Pythagorus  theorem. 
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Theorem  1  Let^^,  ^2  and  be  three  distributions  in  a  dually  flat  manifold.  Then, 
when  the  m-geodesic  connecting  and  ^2  orthogonal  at  ^2  the  e-geodesic 
connecting  (2  and  ^3, 

+  =  (16) 
The  projection  theorem  is  a  consequence  of  this  theorem. 

Theorem  2  Let  M  be  a  submanifold  embedded  in  a  dually  flat  manifold  S.  Let  P 
be  a  point  in  S,  and  let  Qp  is  the  point  in  M  that  minimizes  D{P  :  Q),  Q  e  M, 
that  is,  the  point  in  M  that  gives  the  best  approximation  to  P.  Then,  the  Qp  is  the 
m-geodesic  projection  of  Q  to  M . 


These  properties  are  applied  to  various  fields  of  information  sciences  such  as  statis¬ 
tics  ([3],  [11],  [16],  information  theory  ([5], [9],  control  systems  theory  ([4], [19]), 
dynamical  systems  ([14], [18])  etc.  It  is  also  useful  for  neural  networks  ([6],  [10],  [7]). 
We  will  show  how  it  is  applied  to  neural  networks. 


3  Manifold  of  Feedforward  Networks  and  EM  Algorithm 

In  the  begining,  we  show  a  very  simple  case  of  the  manifold  M  of  simple  stochastic 
perceptrons  or  single  stochastic  neurons.  It  receives  input  x  and  emits  a  binary 
output  z  stochastically,  based  on  the  weighted  sum  w-x  of  input  x.  The  conditional 
probability  of  ^  is  written  as 

^zW  X 

P(z|!E;ii;)=  (17) 

The  joint  distribution  is  given  by 


p{x,  z]  w)  =  q{x)p(z\x;  w).  (18) 

Here,  we  assume  that  q{x)  is  the  normal  distribution  ^*(0,/)  with  mean  0  and 
covariance  metrix  /,  the  identity  matrix. 

Let  M  be  the  set  of  all  the  stochastic  perceptrons.  Since  a  perceptron  is  specified 
by  a  vector  w,  M  is  regarded  as  a  manifold  homeomorphic  to  jR”  where  n  is  the 
dimensions  of  x  and  w.  Here,  ^t7  is  a  coordinate  system  of  M.  We  introduce  a 
natural  Riemannian  geometric  structure  to  M.  Then,  it  is  possible  to  define  the 
distance  between  two  perceptrons,  the  volume  of  M  itself,  and  so  on.  This  is  done 
through  the  Fisher  information,  since  each  point  (perceptron)  in  M  is  regarded  as 
a  probability  distribution  (18).  Let  G{w)  =  {fif*j(iy)}  be  a  matrix  representing  the 
Riemannian  metric  tensor  at  point  w.  We  define  the  Riemannian  metric  by  the 
Fisher  information  matrix. 


9ij{w)  =  E 


d  d 

—  \ogp{z,  x;  logp(z,  x;  w) 


(19) 


where  E  denotes  the  expectation  over  (z,  x)  with  respect  to  the  distribution 
p{z,  x;  w).  In  order  to  calculate  the  metric  G  explicitly,  let  ew  be  the  unit  column 
vector  in  the  direction  of  w  in  the  Euclidean  space  jR” , 

w 

ew  =  1 — r, 

|w| 

where  |w|  is  the  Euclidean  norm,  and  its  transpose.  We  then  have  the  following 
theorem. 
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Theorem  3  The  Fisher  information  metric  is  given  by 

g{w)  =  ci{w)I  +  {c2{w)  -  ci{w)}ewe^,  (20) 

where  w  =  |«;|  (Euclidean  norm)  and  ci{w)  and  C2(w)  are  given  by 

Ci(u>)  =  J  f{w£){l- f(we)}exp^-^s^'^ds,  (21) 

C2(w)  -  j  f{ws){l  -  f{w£)}e'‘ exp  de.  (22) 

The  theorem  shows  that  M  is  a  Riemannaian  manifold  with  Riemannian  metric 
G{w)  which  has  spherical  symmetry.  It  is  remarked  that  the  skewness  tensor  T  = 
{Ujk)) 

vanishes  under  the  distribution  q{x)  ~  N{0, 1).  Hence,  M  is  a  self-dual  Riemannian 
manifold  and  no  dual  affine  connections  appear  in  this  special  case. 

We  now  show  some  applications  of  the  Riemannian  structure. 

The  volume  Vn  of  the  perceptron  manifold  M  is  measured  by 

14  =  y  \/\G{w)\dw  (23) 

where  |G(iy)|  is  the  determinant  of  G  =  (gij)  which  represents  the  volume  density 
by  the  Riemanian  metric.  Let  us  consider  the  problem  of  estimating  w  from  exam¬ 
ples.  ([.From  the  Bayesian  viewpoint,  we  consider  that  w  is  chosen  randomly  subject 
to  a  prior  distribution  ppr(iy).  A  natural  choice  of  ppr(iu)  is  the  Jeffrey  prior  or 
non-informative  prior  given  by 

Ppr(«')  =  (24) 

yn 

When  one  studies  learning  curve  and  overtraining  ([8]),  it  is  important  to  take  the 
effect  of  ppr(ty)  into  account.  The  Jeffrey  prior  is  calculated  as  follows. 

Theorem  4  The  Jeffrey  prior  and  the  volume  of  the  manifold  are  given  by 

\/|G(w)l  =  ^•v/c2(w){ci(w)}’>-i,  (25) 

yn 

14  =  y  y/c2{w){ci{w)}^-^anw'^~'^dw,  (26) 

respectively,  where  a„  is  the  volume  of  the  unit  n-sphere. 

The  gradient  descent  is  a  well  known  learning  method,  which  was  proposed  by 
Widrow  for  the  analog  linear  perceptron  and  extended  to  non-linear  multilayer 
networks  with  hidden  units  ([2], [20]  and  others).  Let  l(x,z;w)  be  the  loss  function 
when  the  perceptron  with  weight  w  processes  an  input-output  pair  {x,z).  In  many 
cases,  the  squared  error 

l{x,z\w)  =  -|2-/(wa:)p 

is  used.  When  input-output  examples  (a;^,^^),  t  =  1,2,  •  ■  are  given,  we  train  the 

perceptron  by  the  gradient- descent  method: 

dl{xt,zt,Wi) 

Wi+l  =Wi~  c - — - 


(27) 
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where  Wt  is  the  current  value  of  the  weight  and  it  is  modified  to  give  wt^i  by  using 
the  current  input-output  (xt,  Zt).  It  was  shown  that  the  learning  constant  c  can  be 
replaced  by  any  positive- definite  matrix  cK  ([2]).  The  natural  choice  of  this  matrix 
is  the  inverse  of  the  Fisher  information  metric, 

dl 

wt^i  =wt~  (28) 

from  the  geometrical  point  of  view,  since  this  is  the  invariant  gradient  under  general 
coordinate  transformations.  However,  it  is  in  general  not  easy  to  calculate  G~^  so 
that  this  excellent  schema  is  not  usually  used  (see  [10]).  We  can  calculate  G~^{w) 
explicitly  in  the  perceptron  case. 


Theorem  5  The  inverse  of  the  Fisher  information  metric  is 
G-^w)  =  -T-/+  -  -L^)  ewt 


This  leads  to  a  very  simple  form  of  learning 

Wt+l  =  Wt-c{zt~  f{Wt.Xt)}f(WfXt) 

This  is  much  more  efficient  compared  to  the  prevailing  simple  gradient  method. 
Any  acceleration  methods  can  be  incorporated  with  this. 

The  geometry  of  the  manifold  N  of  multilayer  feedforward  neural  networks  can  be 
constructed  in  principle  in  the  same  way.  However,  it  is  not  so  easy  to  calculate 
the  explicit  forms  of  the  metric  9ij{w)  and  the  connections  and  The 

geometrical  structure  becomes  very  complicated  because  of  hidden  units. 

Neural  network  researchers  sometimes  use  statistical  methods  such  as  the  EM 
algorithm  [15].  Information  geometry  is  useful  for  elucidating  such  a  method  [7]. 
Let  (x,  2/,  z)  be  the  random  variables  representing  the  input,  outputs  of  hidden  units 
and  the  final  output.  Let  S  be  the  manifold  of  all  the  joint  probability  distributions 
q(x,  y,  z).  The  manifold  N  of  the  probability  distributions  p{x,  y,  z;  w)  realized  by 
the  network  specified  by  w  forms  a  submanifold  in  S. 

We  want  to  realize  a  stochastic  input-output  relation  q{z\x)  or  q{x,z)  and  we  do 
not  care  about  the  values  y  of  hidden  units  so  far  as  the  input-output  relation  is 
well  approximated.  All  the  probability  distributions  whose  marginal  distributions 
are  the  same,  that  is,  those  satisfying 

(31) 

y 

form  an  m-flat  submanifold  D  in  5, 

Dq-  -  {9(a;,j/,z)|^?(a:,y,2)  =  ?*(x,z)}, 

y 

This  is  called  the  data  submanifold.  The  problem  of  approximation  is  formulated  as 
follows  :  Given  q*{x^  z)  obtain  the  neural  network  w*  that  minimizes  D[q(x^  y,  z)  : 
p(x,  y,  z;  lu)],  that  is 


min 

peN 


-p]- 


(32) 


Hence,  the  problem  is  minimization  of  the  divergence  between  two  submanifolds  N 
and  Dq* . 
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This  is  iteratively  solved  as  follows: 

Step  1  :  Given  pi  G  N,  obtain  qi  G  Dq*  that  minimizes  D[q,pi]. 


Step  2  :  Given  qi  G  Dq* .  obtain  Pi+i  €  N  that  minimizes  D[qi,p]. 


The  step  1  is  solved  by  e-projecting  pi  to  Dq* .  This  is  equivalent  to  taking  the 
conditional  expectation  of  hidden  variables  with  respect  to  pi.  The  step  2  is  the 
maximization  of  the  likelihood  and  is  given  by  m-projecting  qi  to  N,  Hence,  the 
procedure  is  called  the  em-algorithm  in  geometry  ([12],  [13],  [7],  [10])  and  the  EM- 
algorithm  in  statistics.  The  algorithm  was  shown  very  effective  in  the  model  of 
mixture  of  expert  nets  ([15],  and  many  others).  Its  geometry  was  studied  by  Amari 
[7],  Xu  [21]  extended  the  idea  to  be  applicable  to  more  general  neural  network 
learning  problems. 

4  Manifold  of  Boltzmann  Machines 

A  Boltzmann  machine  is  a  recurrently  connected  neural  network  composed  of 
binary  stochastic  neurons.  We  show  in  the  beginning  a  simple  Boltzmann  ma¬ 
chine  without  hidden  units  and  then  study  that  with  hidden  units.  The  behav¬ 
ior  of  a  Boltzmann  machine  is  shown  by  the  following  stochastic  dynamics  :  Let 
x{t)  =  (a:i(t),  •  •  • ,  a;„(i))  be  the  state  of  the  network  at  time  t,  where  Xi  takes  on 
0  and  1  representing  the  quiescent  and  excited  states  of  the  ith  neuron.  At  time 
t  +  1,  choose  one  neuron  at  random.  Let  the  zth  neuron  be  chosen.  Then,  the  state 
of  the  zth  neuron  changes  stochastically  by 


Prob{xj(t  +  1)  =  1}  = 


exp{uf} 

1  +  exp{«i}’ 


(33) 


where 

Ui  =  ^  WijXj{t)  -  Wio,  (34) 

j 

Wij  being  the  connection  weights  of  the  ith  and  jth  neurons  and  Wi^  being  the 
biasing  term  of  the  zth  neuron.  All  the  other  neurons  are  unchanged,  Xj{t  +  1)  = 
Xjii). 

The  x{i),  t  =  1, 2,  •  •  •,  is  a  Markov  process,  and  its  stationary  distribution  is  explic¬ 
itly  given  by 

p(x,  =  exp  I  i  ^  WijXiXj  -  ^>(1^)1  (35) 

when  Wij  =  Wji^  wa  =  0  hold,  where  ajo  =  1.  The  set  of  all  the  Boltzmann  machines 
forms  an  n(n  -|-  1) /2-dimensional  manifold  B,  and  W  =  {woi,Wij,i  <  j}  is  a  coor¬ 
dinate  system  of  the  Boltzmann  neural  manifold.  To  each  Boltzmann  machine  W 
corresponds  a  probability  distribution  (35)  and  vice  versa.  Therefore,  B  is  identified 
with  the  set  of  all  the  probability  distributions  of  form  (35). 

Let  S  be  the  set  of  all  the  distributions  over  2"  states  x, 


S  -  {q(x)  I  q(x)  >  0,  Y^q{x)  =  1}.  (36) 

X 

Then,  B  is  an  n(n  +  1) /2-dimensional  submanifold  embedded  in  the  (2”  -  1)- 
dimensional  manifold  S.  We  can  show  that  5  is  a  flat  submanifold  of  5,  and  that 
both  S  and  B  are  dually  flat,  although  they  are  curved  from  the  Riemannian  metric 
point  of  view  [10]. 
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Given  a  distribution  q{x)^  we  train  the  Boltzmann  machine  by  modifying  wij  based 
on  independent  examples  xi,X2-  "  from  the  distribution  g,  for  the  purpose  of  ob¬ 
taining  W  such  that  p{x,W)  approximates  q{x)  as  well  as  possible.  The  degree 
of  approximation  is  measured  by  the  divergence  D[q{x)  :  p{x,W)\.  The  best  ap¬ 
proximation  is  given  by  m-projecting  g  to  5.  We  have  an  explicit  solution  since 
the  manifolds  are  flat.  The  stochastic  gradient  learning  was  proposed  by  Ackley, 
Hinton  and  Sejnowski  [1], 

Wt+i  =  Wt~  r]—D[q-,p]  (37) 

where  the  gradient  term  is  not  the  expectatio  form  but  is  evaluated  by  random 
example  Xt.  However,  from  the  geometrical  poit  of  view,  it  is  more  natural  to  use 

m+i  =  W,-riG-^^Dlq;p],  (38) 

where  is  the  Fisher  information  matrix.  In  this  case,  the  expectation  of  the 
trajectory  Wt  of  learning  is  the  e-geodesic  of  B  (when  the  continuous  time  version  is 
used),  see  [10].  It  is  shown  that  the  convergence  is  much  faster  in  this  case,  although 
calculations  of  G~^  are  sometimes  not  easy. 

When  hidden  units  exist,  we  divide  x  into 

x  =  {x^,x^),  (39) 

where  x^  represents  the  state  of  the  visible  part  and  x^  represents  that  of  the 
hidden  part.  The  connection  weights  are  also  divided  naturally  into  the  three  parts 
W  =  {W^  y  The  probability  distribution  of  the  visible  part  is  given  by 

p{x'^y  W)  =  (40) 

x^ 

Let  and  be  the  manifolds  of  all  the  distributions  over  x^  and  those  realizable 
by  Boltzmann  machines  in  the  form  of  (40).  Then,  is  flat  but  B^  is  not. 

It  is  easier  to  treat  the  extended  manifolds  of  5^^’^  and  B^'^  including  hidden  units. 
However,  only  a  distribution  q{x^)  is  specified  from  the  outside  in  the  learning 
process  and  its  hidden  part  is  not.  Let  us  consider  a  submanifold 

D  =  {q{x'^ ,  a:^)|  q{x'',x")  =  «(»*')}.  (41) 

xff 

The  submanifold  D  is  m-flat.  Because  we  want  to  approximate  q{x^)  but  we  do 
not  care  about  g(aj'^)  nor  the  interaction  of  x^  and  x^  ^  the  problem  reduces  to 
minimizing 

D[q{x^ yX^)  :  p{x^yX^yW)] 

over  all  W  and  g  €  D,  or 

D[q  :  p] 

over  all  p  G  B^'^  and  q  £  D.  This  is  the  minimization  of  the  divergence  between 
two  submanifold  D  and  B^'^ . 

We  can  use  the  geometric  em  algorithm  which  is  equivalent  to  the  statistical  EM 
algorithm  in  this  case,  too.  Since  D  is  m-flat  and  B^>^  is  e-flat,  the  problem  is 
relatively  easy  to  solve  (see  [7]). 
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5  Conclusion 

We  have  shown  the  framework  of  the  information-geometrical  method  of  neural  net¬ 
works.  Information  geometry  has  originated  from  the  research  on  statistical  mani¬ 
folds  or  families  of  probability  distributions.  It  proposes  a  new  geometrical  notion 
of  the  duality  of  affine  connections.  It  has  successfully  been  applied  to  statistics, 
information  theory,  control  systems  theory  and  many  others.  Neural  networks  re¬ 
search  is  one  of  the  promissing  fields  of  applications  of  information  geometry.  Its 
research  has  just  started  with  expectation  that  it  opens  new  perspectives  to  neural 
networks 
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In  the  past  decade,  ^ese^l^ch  in  neurocomputing  has  been  divided  into  two  relatively  well-defined 
tracks:  one  track  deahng  with  cognition  and  the  other  with  behavior.  Cognition  deals  with  orga¬ 
nizing,  classifying  and  recognizing  sensory  stimuli.  Behavior  is  more  dynamic,  involving  sequences 
of  actions  and  changing  interactions  with  an  external  environment.  The  mathematical  techniques 
that  apply  to  these  areas,  at  least  from  the  point  of  nemo  computing,  appear  to  have  been  quite 
separate  as  well.  The  purpose  of  this  paper  is  to  give  an  overview  of  some  recent  powerful  math¬ 
ematical  results  in  behavioral  nemocomputing,  specifically  the  concept  of  Q-leaming  due  to  C. 
Watkins,  and  some  new  extensions.  Finally,  we  propose  ways  in  which  the  mathematics  of  cogni¬ 
tion  and  the  mathematics  of  behavior  ccin  move  closer  to  build  more  miified  systems  of  information 
processing  and  action. 

1  Introduction 

The  study  of  artificial  neural  networks  has  burgeoned  in  the  past  decade.  Two 
distinct  lines  of  research  have  emerged:  the  cognitive  and  the  behavioral.  Cognitive 
research  deals  with  the  biological  phenomenon  of  recognition,  the  mathematics  of 
pattern  analysis  and  statistics,  and  applications  in  automatic  pattern  recognition. 
Behavioral  research  deals  with  the  biological  phenomena  of  planning  and  action,  the 
mathematics  of  time  dependent  processes,  and  applications  in  control  and  decision¬ 
making. 

To  be  mathematically  precise,  let  us  discuss  simple  formulations  of  each  problem 
type.  A  cognitive  problem  typically  involves  so-called  feature  vectors,  x  6  IR”.  These 
feature  vectors  are  sensory  stimuli  or  measurements  and  are  presented  to  us  in  some 
random  way  -  that  is,  we  cannot  predict  with  certainty  which  stimuli  will  occur 
next.  These  observations  must  be  classified  and  the  classification  is  accomplished 
by  a  function  /  :  IR”  — IR"^.  The  problem  is  to  build  an  estimate  of  the  true 
function,  /,  based  on  a  sample  set  of  the  form  S  =  {(^*,2/0)^  —  which 

are  drawn  according  to  a  joint  probability  distribution  on  (x,y)  6  say 

fji{x,y).  Call  the  estimate  g{x).  Typically,  f(x)  is  the  conditional  expected  value 
of  y:  f{x)  =  Jy  ydpi{x,  y)/  d^(x,y).  A  number  of  distinct  situations  in  cognitive 

neurocomputing  arise  depending  on  the  type  of  information  available  for  estimating 
/.  (See  one  of  the  numerous  excellent  textbooks  on  neural  computing  and  machine 
learning  for  more  details.) 

■  Supervised  Learning  -  The  sample  data,  S,  is  as  above  so  that  correct  clas¬ 

sifications,  apart  from  noise,  are  part  of  the  data.  Moreover,  an  error  criterion 
is  typically  provided  or  constructed  on  the  classifications:  —  f{x)\  is  some 

error  metric.  This  situation  is  closest  to  traditional  statistical  regression  and 
uses  many  classical  statistical  methods. 

■  Unsupervised  Learning  -  No  corrrect  or  approximate  classification  values, 
y,  are  given.  The  problem  is  to  organize  the  data  into  equivalence  classes  or 
clusters  and  then  to  label  the  classes.  These  labels  serve  to  define  /  from  above 
and  the  performance  criterion  is  related  to  how  accurately  the  equivalence 
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classes  are  formed.  Unsupervised  learning  is  closely  related  to  clustering  and 
uses  techniques  from  that  area. 

■  Reinforcement  Learning  -  As  in  unsupervised  learning,  no  “correct”  re¬ 
sponse  is  given  but  a  reinforcement  is  available.  A  reinforcement  can  be  thought 
of  as  an  error  when  g{x)  is  the  estimated  response  for  feature  vector  x  but  the 
correct  response  /  is  not  available.  Reinforcement  learning  is  thus  between 
supervised  and  unsupervised  learning:  some  performance  feedback  is  provided 
but  not  in  the  form  of  the  correct  answer  and  the  deviation  of  "the  response 
from  it.  The  difference  between  supervised  and  reinforcement  learning  has  been 
characterized  as  the  difference  between  learning  from  a  teacher  and  learning 
from  a  critic.  The  teacher  provides  the  correct  response  but  the  critic  only 
states  how  bad  the  system’s  response  was.  Reinforcements  are  often  referred 
to  as  “rewards”  and  “punishments”  depending  on  their  sizes  and  the  context. 

Behavioral  neurocomputing  derives  its  problems  from  planning,  decision-making 
and  other  applications  in  which  actions  over  time  are  of  fundamental  interest.  The 
mathematical  framework,  which  will  be  formalized  below,  involves  a  system  with 
states.  Actions  are  available  to  move  the  system  into  another  state,  perhaps  stochas¬ 
tically.  There  is  an  immediate  cost  associated  with  taking  an  action.  An  action  at 
any  given  time  may  be  optimal  in  the  short  run  but  may  not  be  the  best  over 
the  long  run  when  future  costs  and  actions  are  taken  into  account.  Behavioral  ap¬ 
plications  are  therefore  most  naturally  cast  as  reinforcement  learning  problems. 
The  goal  of  behavioral  learning  is  to  have  a  system  infer,  from  empirical  data,  the 
state-action  pairs  which  give  the  smallest  average  cost. 

There  are  a  number  of  ways  that  such  a  behavioral  system  could  be  reduced  to  a 
cognitive  problem.  One  involves  building  the  following  type  of  training  set.  Given 
some  initial  state,  generate  a  sequence  of  actions  randomly,  stopping  at  some  future 
time.  Record  the  initial  state,  the  action  sequence  and  the  ultimate  cost  of  taking 
that  action  sequence  from  that  initial  state.  From  this  data,  build  a  classifier  whose 
input  feature  vector  is  a  state  and  whose  output  is  the  action  sequence  which  has 
minimal  cost  over  all  sequences  tried.  This  estimate  of  the  minimal  cost  to  solve 
the  problem  from  an  initial  state  is  an  estimate  of  the  cost-to-go  function. 

A  related  way  to  solve  the  problem  involves  estimating  these  cost-to-go  values  for 
each  state-action  pair.  Notice  that  in  a  stochastic  setting,  the  action  which  em¬ 
pirically  resulted  in  the  least  cost  solution  may  not  be  the  action  which  leads  to 
the  least  cost  expected  value.  So  averaging  is  more  appropriate  than  merely  keep¬ 
ing  track  of  the  minimum  cost  solution  for  a  given  action  sequence.  The  artificial 
intelligence  community  has  had  considerable  difficulties  with  behavioral  learning 
because  of  difficulties  with  temporal  delays  and  combinatorial  explosions  result¬ 
ing  from  brute  force  approaches.  Only  recently  has  the  mathematics  of  controlled 
Markov  chains  been  explored  along  with  solution  techniques  such  as  dynamic  pro¬ 
gramming.  A  major  breakthrough  in  learning  optimal  actions  for  such  processes 
has  been  Watkins’  Q-learning  [3,  2,  4,  1]. 

In  this  paper,  we  give  a  quick  overview  of  controlled  Markov  processes  in  Section 
2.  Section  3  presents  Watkins’  basic  results  and  Section  4  extends  those  results  to 
describe  a  system  that  learns  and  simultaneously  converges  to  the  optimal  policy 
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Cognitive  Learning 

Behavioral  Learning 

Goal 

Classifying  inputs 

Optimizing  actions 

Related  Areas 

Statistics 

Pattern  recognition 

Control  theory 
Operations  research 

Relevant 

Mathematics 

Approximation  theory 
Probability 

Dynamic  programming 
Stochastic  Systems 

Status 

Sound  Theoretical 
Foundations 

Emerging  Analytic 
Theory 

Table  1  Cognitive  vs.  Behavioral  Learning. 


solution.  Section  5  discusses  possible  areas  of  intersection  between  the  cognitive 
and  behavioral  problems. 

2  Markov  Decision  Processes 

The  paradigm  for  behavioral  neurocomputing  can  typically  be  cast  as  a  controlled 
Markov  process  which  is  described  below.  An  environment  has  a  number  of  states, 
X  —  {a:,}.  For  each  state,  there  is  a  set  of  actions,  Ai  =  {a^j},  each  of  which 
transforms  the  current  state  to  another  state  stochastically.  This  is  quantified  as  a 
collection  of  transition  probabilities:  p{xi,  a,y ,  Xk)  being  the  probability  that  apply¬ 
ing  action  aij  in  state  Xi  we  land  in  state  Xk  at  the  next  time  step.  These  proba¬ 
bilities  are  independent  of  time  and  previous  states,  hence  the  stationary  Markov 
character. 

Each  action  engenders  a  cost,  k{xi^aij)^  which  depends  only  on  the  state  and  the 
action.  The  goal  of  a  behavioral  problem  is  to  minimize  this  cost  over  time.  Such 
a  temporal  cost  depends  on  a  policy  which  specifies  the  actions  to  be  taken  at 
various  times  in  various  states.  Thus  for  each  time,  i,  in  a  policy,  we  have  a  vector 
of  actions  at  which  tells  us  which  action  to  apply  for  any  of  the  possible  states  the 
system  could  be  in  at  time  t.  Specifically,  at{xi)  is  one  of  the  allowable  actions  aij 
for  state  x,-.  There  are  a  number  of  ways  in  which  to  quantify  this  temporal  cost 
minimization: 

■  Free-time  minimization  -  A  designated  state,  Xo,  terminates  the  process. 
We  are  concerned  with  the  cost  of  the  behavior  up  to  the  time  this  state  is 
reached.  Starting  in  some  state  x*  and  following  a  policy,  tt  =  (ao,ai,a2, ...), 
suppose  we  follow  the  trajectory  x»,  Xi^,  x^-^, ...,  x*^,,,  xq^  the  cost  will  then  be 

T— 1 

Y^k{xi.,aj(xi.)). 

j=o 

However,  since  the  transitions  are  stochastic,  we  must  take  an  expectation  over 
all  possible  trajectories,  starting  with  Xi  and  terminating  in  xq  with  probabil¬ 
ities  determined  by  the  policy,  tt,  and  the  corresponding  transitions. 

T-l 

V(ir,i)  =  E{^k(xi.,aj(xi^))}. 

3=0 
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■  Fixed-time  minimization  -  As  above  but  the  cost  is  computed  up  to  a  fixed 
time  as  opposed  to  when  a  terminal  state  is  reached. 

■  Average-cost  minimization  -  The  average  future  cost  is  minimized  (here 
0  <  7  <  1  is  the  discount  factor): 

1  ^ 

l/(x,  i)  = 

j=o 

m  Discounted  cost  minimization  -  Here  the  cost  of  future  actions  are  dis¬ 
counted  by  a  constant  rate,  0  <  7  <  1,  so  that 

00 

V(7r,i)  = 

j=o 

We  consider  only  discounted  cost  minimization  problems  here.  A  large  body  of 
literature  in  optimal  control  theory  has  been  devoted  to  solving  such  problems  and 
we  review  the  key  elements,  leaving  out  the  occasional  technical  requirements  for 
the  results  to  hold.  We  focus  instead  on  the  general  flow  of  ideas. 

First  of  all,  we  need  only  consider  policies  which  are  independent  of  time  if  we  are 
interested  in  the  optimal  policy.  The  reason  is  that  the  process  is  Markovian  and 
once  we  reach  a  state  at  a  point  in  time,  only  the  future  costs  can  be  affected.  Since 
the  costs  are  additive  (in  a  discounted  way),  an  optimal  choice  of  action  depends 
only  on  the  current  state.  This  simplification  allows  us  to  consider  only  policies  that 
are  of  the  form  tt  =  (a,  a,  a, ...)  where  a  is  a  mapping  from  states  to  actions. 
Secondly,  the  cost-to-go  functions,  which  we  now  write  as  V(a,i)  because  we  are 
restricting  ourselves  to  stationary  policies,  satisfy  a  recurrence  relation  of  the  form: 

V(a,z)  =  k(xi,a(xi))  ~i-yY^p(xi,a(xi),Xj)V(a,j). 

j 

This  identity  has  the  following  interpretation:  using  action  0(2:*)  we  go  to  state 
Xj  with  probability  p{xi,a(xi),Xj);  once  in  state  Xj  we  have  the  cost-to-go  value 
V{a,j).  Weighing  these  cost-to-go  values  with  the  corresponding  transition  proba¬ 
bilities  and  adding  gives  the  identity.  Stacking  these  values  to  form  a  vector  V{a), 
the  above  equation  becomes  a  vector-matrix  equation  of  the  form: 

V{a)  =  k{a)  +  'yP{a)V{a). 

Thus,  every  stationary  policy  satisfies  an  equation  of  the  above  form  which  can  be 
solved  by  rewriting  it  as: 

(/-7P(a))7(a)  =  fc(a) 

where  it  is  assumed  that  P{a)  and  k{a)  are  known  for  a  given  policy  a.  Thus  this 
is  a  simple  linear  system  of  equations  which  can  be  solved  using  standard  matrix 
iterative  techniques. 

What  this  shows  is  that  for  any  policy,  there  is  a  computational  procedure  for 
computing  the  cost-to-go  function  for  a  fixed  action  policy.  What  remains  is  to  find 
an  action  policy  which  will  lead  to  the  overall  optimal  strategy  -  one  that  minimizes 
the  cost-to-go  function  for  each  state. 

Specifically,  the  optimal  policy  a*  has  the  property  that 

F(a*)  =  V*  z=  mina{k{a)  +  7P(a)l/*}. 
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This  policy  can  be  computed  using  the  iteration,  starting  with  Vq  arbitrary, 

Vn+i  =  mina{k(a)  -f-  jP{a)Vn}. 

Then  Vn  V*  and  the  optimal  policy  is  the  policy  which  realizes  the  minimization 
above. 

3  Watkins’  Q-Learning 

The  behavioral  learning  problem  is  to  learn  the  optimal  (minimal  cost)  policy  using 
samples  of  the  following  form: 

Si  {^ji,  ki),i  =  1, 2, 3, ... 

These  samples  are  four-tuples  with  the  interpretation  that  Si  involves  a  trial  use 
of  action  a*  on  state  rji  which  resulted  in  a  transition  to  state  Ci  at  a  cost  of  ki. 
Given  many  samples  with  the  same  initial  state  and  action,  we  can  estimate  the 
transition  probabilities  as  well  as  the  expected  cost  of  taking  that  action  from  that 
state.  With  such  estimates,  a  standard  solution  procedure  for  computing  optimal 
policies  for  a  Markov  decision  process  can  be  carried  out.  This  is  an  off-line,  batch 
approach. 

Is  there  an  on-line  approach  which  will  learn  the  optimal  strategy  adaptively,  with 
guaranteed  convergence?  The  on-line  strategy  does  not  compute  transition  prob¬ 
abilities  or  costs  explicitly.  Watkins  has  given  an  elegant  solution  to  this  problem 
which  he  named  Q-learning.  Q-learning  uses  samples  of  the  above  form  to  update 
a  simple  table  whose  entries  converge  and  whose  limit  values  can  easily  be  used  to 
infer  an  optimal  policy. 

Q-learning  is  motivated  by  so-called  Q- values  which  are  defined  as  follows: 
Q^Xi.an)  =  K{xi,  aij)  +  7P{a)V{a). 

Q^^ixijGij)  is  the  expected  cost  when  starting  in  state  Xj,  performing  action  aij,  and 
then  following  policy  a  thereafter.  Let  Q*  be  the  Q-values  for  the  optimal  policy 
a*,  meaning  we  take  some  initial  action  and  then  follow  it  with  optimal  actions 
thereafter.  For  these  Q-values,  we  have 

F(a*)i  =  min{(3*(a;j,a.-,)} 

Qij 

and  the  optimal  action  from  state  Xj  is  precisely  the  action  which  achieves  the 
minimum. 

The  beauty  of  Watkins’  Q-learning  is  that  we  can  adaptively  estimate  the  Q-values 
for  the  optimal  policy  using  samples  of  the  above  type.  The  Q-learning  algorithm 
begins  with  a  tableau,  Qo{xi,  aij),  initialized  arbitrarily.  Using  samples  of  the  form 
Si  =  {T]i,Q,  ai,  ki),  i  =  1, 2,  3, ... 
we  perform  the  following  update  on  the  Q-value  tableau: 

Qi{r]u<^i)  =  (1  -  a.)  +  Pi(ki  +  yViCi)) 

where  V{Q)  =  mina{Qx-i(Ct,  <^)}.  All  other  elements  of  Qi  are  merely  copied  from 
without  change.  The  parameters  0  as  i  oo. 

Theorem  1  (Watkins  [6,  7,  5])  -  Lei  {i(a;,a)}  be  the  set  of  indices  for  which  the 
(a:,  a)  entry  of  the  Q-iableau  is  updated.  If 

^  A(x,a)  =  oo  and  ^  <  oo 

then  Qi(x,a)  — ^  Q*(x,a)  as  i  — ►  oo.  Accordingly,  ai{x)  =  argmax^Qi(a;,  a)  con¬ 
verges  to  the  optimal  action  for  state  x. 
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Proof  See  [7,  5].  ° 

This  result  is  remarkable  in  that  it  demonstrates  that  a  simple  update  rule  on  the 
Q-tableau  results  in  a  learning  system  which  computes  the  optimal  policy.  In  the 
next  section  we  show  how  this  method  can  be  embedded  into  an  online  system  which 
simultaneously  uses  the  current  Q-values  to  generate  policies  and  uses  the  results 
to  update  Q-values  in  such  a  way  as  to  satisfy  Watkin’s  theorem.  Consequently,  the 
online  system  learns  the  optimal  policy  to  which  it  ultimately  converges. 

4  Universal  On-line  Q-Learning 

Online  optimal  learning  and  asymptotic  optimal  performance  must  be  a  compromise 
between  performing  what  the  learning  system  estimates  to  be  the  optimal  actions 
and  persistently  exciting  all  possible  state-action  possibilities  often  enough  to  satisfy 
the  frequencies  stipulated  in  Watkins’  Q-learning  theorem. 

To  begin,  let  us  introduce  a  notion  of  asymptotically  optimal  performance.  Suppose 
we  have  an  infinite  sequence  of  state-action  pairs  that  result  from  an  actual  realiza¬ 
tion  of  the  Markov  decision  process,  say  {(2/1,0;*),  i  =  0, 1, ...}  with  (2/i,  o*)  meaning 
that  at  time  i  we  are  in  state  yi  and  execute  action  a*,  resulting  in  a  transition  to 
state  yi+i  at  the  next  time.  For  such  a  sequence,  we  have  a  corresponding  sequence 
of  costs  Vi  which  represent  the  actual  costs  computed  from  the  realized  sequence, 
Vi  being  the  cost  of  following  the  sequence  from  time  i  onwards. 

We  now  define  asyptotic  average  optimality.  Suppose  that  Vi  is  the  optimal  expected 
cost-to-go  for  state  i.  Let  Vi{nij)  be  the  observed  cost-to-go  values  for  state  i  when 
the  system  is  in  state  i  at  time  riij.  Let  NmII)  be  the  number  of  times  the  process 
has  visited  state  i  up  to  time  M.  Then  we  say  that  the  policy  is  asyptotically  optimal 
on  average  for  state  i  if 

.i.<M 


7Vm(«) 


as  M  — >•  00. 

In  order  to  establish  this  result,  we  need  to  construct  an  auxiliary  Markov  process 
with  states  (xi,aij)  and  transition  probabilities, 

p((Xi,  dij),  (Xfji,  flmn))  —  P(^i)  ^ij ,  Xfu)/ Jm 

where  Jm  is  the  number  of  allowable  actions  in  state  Xm>  This  auxiliary  process 
has  the  following  interpretation:  the  transition  probabilities  between  states  are  de¬ 
termined  by  the  transition  probabilities  of  the  original  controlled  process  with  the 
added  ingredient  that  the  actions  corresponding  to  the  resulting  state  are  uniformly 
randomized. 


Theorem  2  -  Suppose  that  there  is  at  least  one  stationary  policy  under  which  the 
states  of  the  resulting  Markov  process  constitute  a  single  recurrent  class  (that  is,  the 
process  is  ergodic).  Then,  with  probability  1,  there  is  a  mixed  nonstaiionary  strategy 
for  controlling  the  Markov  decision  process  with  the  following  two  properties: 

1.  The  optimal  policy  is  estimated  by  Q-learning  asymptotically; 

2.  The  control  is  asymptotically  optimal  on  average  for  all  states  that  have  a  nonzero 
stationary  occupancy  probability  under  the  process’  optimal  policy. 

Proof  -  The  proof  consists  of  three  parts.  We  begin  by  showing  that  under  the 
hypothesis  that  the  original  process  has  a  stationary  policy  under  which  the  Markov 
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Transition  within  a  row 
is  uniformly  randomized 


Figure  1  Auxili^l^y  Process  Schematic. 


process  consists  of  a  single  recurrent  class,  the  corresponding  auxiliary  process  also 
has  a  single  recurrent  class.  Let  a{x)  be  the  action  for  a  state  x  which  makes  the 
original  process  have  a  single  recurrent  set.  Denote  the  corresponding  transition 
probabilities  by  p*(x,  x')  =  p{x,  a{x),x').  Let  D  be  the  maximal  number  of  actions 
per  state:  D  >  |{aij  }|  for  all  i.  Consider  the  auxiliary  process’  canonical  partitioning 
into  subclasses  of  states  in  which  each  set  of  the  partition  consists  of  states  which 
are  either  a)  transient  or  b)  recurrent  and  communicate  with  each  other.  Now  the 
(a:,  a(aj))  form  a  subset  of  the  states  of  the  auxiliary  process. 

Step  1.  The  auxiliary  Process  is  Recurrent  -  We  will  show  that  any  given  state  of 
the  auxiliary  process  communicates  with  every  other  state.  Let  (a:,  a)  and  {x',a') 
be  any  two  states  of  the  auxiliary  process.  We  must  show  that 

p^’^\{x,a),(x',a'))  >  0 

for  some  n.  Pick  any  state  y  for  which  p{x,  a,  y)  >  0.  By  assumption,  there  is  an  m 
for  which 

pi'^\y,x')  >  0 

under  the  special  choice  of  actions,  a(a:).  This  means  that  there  is  a  sequence  of 
states  yi  =  y,  2/2,  2/m  =  a?'  for  which 


and  so 


D 


n 


m  — 1 
i=l 


\piyi,ciiyi),yi+i)/D]  >  o. 


Since  all  states  of  the  auxiliary  process  communicate  with  each  other,  all  states 
must  be  recurrent.  Thus  the  auxiliary  process  consists  of  a  single  recurrent  set  of 
states.  Because  the  auxiliary  process  consists  of  recurrent  states,  the  expected  time 
to  visit  all  states  of  the  auxiliary  process  is  finite  and  all  states  will  be  visited  in  a 
finite  time  with  probability  1. 

Step  2.  A  Mixed  Policy  -  We  next  define  a  mixed  time-dependent  strategy  for 
switching  between  the  auxiliary  process  and  a  stationary  policy  based  on  an  esti- 
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mate  of  the  optimal  policy.  The  mixed  strategy  begins  by  choosing  actions  according 
to  the  auxiliary  process.  At  time  Uk  we  switch  from  the  auxiliary  process  to  the 
process  controlled  by  the  policy  determined  by  the  Q-table  according  to 

aj(nfc)  =  argmin„(5nfc(a;j,a). 

At  time  mjb  we  switch  back  to  the  auxiliary  process.  For  k  =  1,2,3,...  we  have 
1  =  mo  <  njb  <  m*  <  njt+i  <  mjb+i...  <  oo.  Let  N*  the  largest  expected  time  to 
visit  all  states  of  the  auxiliary  process  starting  from  any  state  and  let  N  =  2  *  N' . 
Consider  the  following  experiment  which  is  relevant  to  the  discussion  below.  Start  in 
any  state  of  the  auxiliary  process  and  follow  it  for  N  steps  at  which  time  we  jump  to 
an  arbitrary  state  and  run  for  another  N  steps,  repeating  this  process  indefinitely. 
Call  each  such  run  of  N  steps  3.  frame,  numbering  them  F,-  for  i  =  1,  2, 3, ....  We  want 
to  concatenate  the  frames  to  produce  a  realization  of  the  auxiliary  process.  To  do 
this,  consider  the  final  state  of  frame  Ff .  Since  N  was  chosen  to  be  twice  as  large  as 
the  expected  time  to  visit  all  states  starting  from  any  state,  with  probability  1  some 
future  frame,  say  Fj+j,  includes  a  visit  to  the  final  state  of  Fi  in  the  first  N/2  =  N' 
states  of  that  frame.  Concatenate  to  the  end  of  Fi  the  history  in  Fi+j  following 
the  visit  to  the  final  state  of  Fi .  Each  step  of  this  concatenation  adds  at  least  N' 
consecutive  steps  of  the  auxiliary  process  so  there  are  an  infinite  number  of  visits  to 
all  states  in  the  concatenation  with  probability  1.  In  the  following  construction,  we 
implicitly  use  this  property  of  a  concatenation  of  frames.  The  definition  of  Uk  is  as 


Figure  2  The  Mixed  Policy  Transitions. 

follows.  At  time  rrik-i,  we  began  following  the  auxiliary  process  which  is  recurrent. 
We  store  state  transitions  of  the  form 

(7?,C,a,/c), 

where  77  is  a  state  Xi,  a  is  an  action  taken  by  the  auxiliiary  process,  (  is  the  state  of 
the  original  process  to  which  the  process  transitioned  and  k  is  the  observed  cost  of 
that  transition,  in  a  list.  Proceed  in  this  manner  until  either  N  steps  have  been  taken 
in  this  way  or  until  the  list  contains  a  sample  for  each  element  in  the  Q-table.  If  the 
list  to  update  the  Q-table  is  completed  first,  we  update  the  Q-table  and  compute 
the  current  optimal  actions.  In  this  case,  Uk  is  defined  as  the  time  after  which  the 
Q-table  Wcis  updated.  Otherwise,  define  Uk  =  ruk-i  +  N  and  continue  using  the 
previous  estimated  optimal  actions  as  a  policy.  In  either  case,  rrik  =  Uk  k  *  N  so 
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we  operate  the  process  using  the  estimated  optimal  policy  for  increasingly  longer 
periods  of  time  k*N.  We  then  use  this  list  to  update  the  Q-table  with  the  samples 
in  the  list. 

Assume  that  the  Q-table  has  M  entries.  Then  for  updating  the  zth  element  for  the 
jth  time  in  the  Q-table  from  the  sample  list,  use 

~  T7~, 

J  *  M  -f  Z 

According  to  this  scheme,  element  i  is  updated  for  j  =  1,2,3,...  and  the  corre¬ 
sponding  subset  of  j^’s  forms  an  arithmetic  subsequence  of  the  harmonic  sequence; 
hence  it  satifies  Watkins’  criteria  for  divergence  and  squared-convergence. 

By  the  discussion  above  about  frames,  this  list  is  filled  in  a  finite  amount  of  time 
infinitely  often  with  probability  1,  but  we  impose  a  deterministic  limit  on  the  time 
spent  in  this  mode  generating  a  frame.  By  construction  and  Watkins’  Theorem, 
the  Q-table  values  converge  to  the  optimal  Q-values  and  hence  the  optimal  action 
policy  is  eventually  found. 

Step  3.  Asymptotic  Optimal  Convergence- To  pxoye  the  asymptotic  optimal  conver¬ 
gence,  note  that  the  Q-table  updates  are  performed  according  to  Watkins’  criteria 
so  that  in  the  limit  the  Q-table  values  determine  an  optimal  stationary  policy.  More¬ 
over,  the  time  spent  operating  in  the  auxiliary  mode  becomes  an  asymptotically 
small  fraction  of  the  time  spent  operating  in  the  estimated  optimal  policy.  Hence, 
the  asymptotic  convergence  to  optimality. 

To  make  this  formal,  note  that  asymptotic  average  optimality  depends  only  on  what 
happens  in  the  limit.  Specifically,  if  we  can  show  that  the  average  of  the  observed 
cost-to-go  values  after  some  fixed  time  converges  to  the  optimal  value  then  we  are 
done. 

To  see  this,  run  the  process  long  enough  so  that  the  estimated  optimal  policy 
determined  by  the  Q-table  is  as  close  as  desired  to  the  optimal  cost-to-go  for  the 
process.  This  will  happen  at  some  finite  time  with  probability  1  although  the  time 
is  not  known  a  priori.  Since  we  can  also  wait  until  k  is  arbitrarily  large,  the  fraction 
of  time  spent  in  the  appoximately  optimal  mode  can  be  made  as  close  to  1  as  we 
like.  Now  for  a  state  that  has  nonzero  stationary  occupancy  probability  under  the 
optimal  policy,  the  fraction  of  time  spent  operating  under  the  estimated  optimal 
policy  approaches  1  and  so  the  empirical  cost-to-go  also  approaches  the  optimal.  For 
states  that  have  zero  occupancy  probability  under  the  optimal  policy,  the  empirical 
cost-to-go  will  be  dominated  by  the  empirical  values  determined  during  the  auxiliary 
process  operation  which  will  not  be  optimal  in  general.  Thus  the  average  within 
the  mixed  mode  of  operation  is  guaranteed  to  converge  to  the  estimated  cost-to-go 
for  the  optimal  policy  but  only  for  states  that  have  nonzero  stationary  occupancy 
probabilities  under  the  optimal  policy.  O 

5  Discussion 

We  have  shown  that  there  exists  a  strategy  for  operating  a  Markov  Decision  Process 
(under  simple  conditions)  in  such  a  way  that  the  optimal  strategy  is  both  learned 
and  the  asymptotic  operation  of  the  system  approaches  optimality  for  states  that 
have  nonzero  occupancy  probabilities  under  the  computed  optimal  policy.  This  is 
only  one  of  many  possible  strategies  for  performing  both  of  these  simultaneously. 
The  question  of  which  operating  procedures  are  best  for  fastest  convergence  to  the 
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Figure  3  A  Unified  Learning  Theory. 


optimal  cost-to-go  values  is  beyond  the  scope  of  the  techniques  that  we  use.  It  is 
an  important  question  to  pursue  in  the  future. 

One  of  the  weaknesses  of  the  Q-Learning  framework  is  that  states  and  actions  for  the 
Q-Tableaux  must  be  known  a  priori.  This  is  restrictive  in  most  dynamic  learning 
situations  -  it  may  not  be  appropriate  or  possible  to  select  the  actual  states  or 
actions  of  a  system  without  significant  experimentation  first.  In  general,  we  would 
like  to  simultaneously  learn  the  states,  actions  and  corresponding  optimal  policies 
at  one  time.  This  subject  has  been  looked  into  by  various  researchers  but  with  few 
analytic  results  yet.  There  has  been  growing  interest  in  dealing  with  systems  that 
have  an  infinite  (contiuum)  of  possible  states  and  actions,  requiring  discretization 
for  Q-Learning.  How  these  states  can  be  clustered  or  otherwise  organized  to  achieve 
optimal  operation  is  a  challenging  question  that  requires  serious  future  research. 
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ARE  THERE  UNIVERSAL  PRINCIPLES  OF  BRAIN 

COMPUTATION? 

Stephen  Grossberg 

Boston  University,  Department  of  Cognitive  and  Neural  Systems  and 
Center  for  Adaptive  Systems,  611  Beacon  Street,  Boston,  Massachusetts  02215  USA 

1  Introduction 

Are  there  universal  computational  principles  that  the  brain  uses  to  self-organize 
its  intelligent  properties?  This  lecture  suggests  that  common  principles  are  used  in 
brain  systems  for  early  vision,  visual  object  recognition,  auditory  source  identifica¬ 
tion,  variable-rate  speech  perception,  and  adaptive  sensory-motor  control,  among 
others.  These  are  principles  of  matching  and  resonance  that  form  part  of  Adaptive 
Resonance  Theory,  or  ART.  In  particular,  bottom-up  signals  in  an  ART  system 
can  automatically  activate  target  cells  to  levels  capable  of  generating  suprathresh- 
old  output  signals.  Top-down  expectation  signals  can  only  excite,  or  prime,  target 
cells  to  subthreshold  levels.  When  both  bottom-up  and  top-down  signals  are  si¬ 
multaneously  active,  only  the  bottom-up  signals  that  receive  top-down  support  can 
remain  active.  All  other  cells,  even  those  receiving  large  bottom-up  inputs,  are  in¬ 
hibited.  Top-down  matching  hereby  generates  a  focus  of  attention  that  can  resonate 
across  processing  levels,  including  those  that  generate  the  top-down  signals.  Such 
a  resonance  acts  as  a  trigger  that  activates  learning  processes  within  the  system. 
In  the  examples  described  herein,  these  effects  are  due  to  a  top-down  nonspecific 
inhibitory  gain  control  signal  that  is  released  in  parallel  with  specific  excitatory 
signals. 

2  Neural  Dynamics  of  Multi-Source  Audition 

How  does  the  brain’s  auditory  system  construct  coherent  representations  of  acoustic 
objects  from  the  jumble  of  noise  and  harmonics  that  relentlessly  bombards  our 
ears  throughout  life?  Bregman  [1]  has  distinguished  at  least  two  levels  of  auditory 
organization,  called  primitive  streaming  and  schema-based  segregation,  at  which 
such  representations  are  formed  in  order  to  accomplish  auditory  scene  analysis.  The 
present  work  models  data  about  both  levels  of  organization,  and  shows  that  ART 
mechanisms  of  matching  and  resonance  play  a  key  role  in  achieving  the  selectivity 
and  coherence  that  are  characteristic  of  our  auditory  experience. 

In  environments  with  multiple  sound  sources,  the  auditory  system  is  capable  of 
teasing  apart  the  impinging  jumbled  signal  into  different  mental  objects,  or  streams, 
as  in  its  ability  to  solve  the  cocktail  party  problem.  With  my  colleagues  Krishna 
Govindarajan,  Lonce  Wyse,  and  Michael  Cohen  [5],  a  neural  network  model  of 
this  primitive  streaming  process,  called  the  ARTSTREAM  model  (Figure  1),  has 
been  developed  that  groups  different  frequency  components  based  on  pitch  and 
spatial  location  cues,  and  selectively  allocates  the  components  to  different  streams. 
The  grouping  is  accomplished  through  a  resonance  that  develops  between  a  given 
object’s  pitch,  its  harmonic  spectral  components,  and  (to  a  lesser  extent)  its  spatial 
location.  Those  spectral  components  that  are  not  reinforced  by  being  matched  with 
the  top-down  prototype  read-out  by  the  selected  object’s  pitch  representation  are 
suppressed,  thereby  allowing  another  stream  to  capture  these  components,  as  in  the 
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“old-plus-new  heuristic”  of  Bregman  [1].  These  resonance  and  matching  mechanisms 
are  specialized  versions  of  ART  mechanisms. 


Pitch 

summation 

layer 


Input  signal 


Figure  1  Block  diagram  of  the  ARTSTREAM  auditory  streaming  model.  Note 
the  nonspecific  inhibitory  feedback  from  pitch  representations  to  spectral  represen¬ 
tations. 

The  model  is  used  to  simulate  data  from  psychophysical  grouping  experiments,  such 
as  how  a  tone  sweeping  upwards  in  frequency  creates  a  bounce  percept  by  grouping 
with  a  downward  sweeping  tone  due  to  proximity  in  frequency,  even  if  noise  replaces 
the  tones  at  their  intersection  point.  The  model  also  simulates  illusory  auditory 
percepts  such  as  the  auditory  continuity  illusion  of  a  tone  continuing  through  a 
noise  burst  even  if  the  tone  is  not  present  during  the  noise,  and  the  scale  illusion 
of  Deutsch  whereby  downward  and  upward  scales  presented  alternately  to  the  two 
ears  are  regrouped  based  on  frequency  proximity,  leading  to  a  bounce  percept.  The 
stream  resonances  provide  the  coherence  that  allows  one  voice  or  instrument  to  be 
tracked  through  a  multiple  source  environment. 

3  Neural  Dynamics  of  Variable-Rate  Speech  Categorization 

What  is  the  neural  representation  of  a  speech  code  as  it  evolves  in  real  time? 
With  my  colleagues  Ian  Boardman  and  Michael  Cohen  [6],  a  neural  model  of  this 
schema-based  segregation  process,  called  the  ARTPHONE  model  (Figure  2),  has 
been  developed  to  quantitatively  simulate  data  concerning  segregation  and  integra¬ 
tion  of  phonetic  percepts,  as  exemplified  by  the  problem  of  distinguishing  “topic” 
from  “top  pick”  in  natural  discourse.  Psychoacoustic  data  concerning  categorization 
of  stop  consonant  pairs  indicate  that  the  closure  time  between  syllable  final  (VC) 
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Figure  2  The  ARTPHONE  model.  Working  memory  nodes  (w)  excite  chunks  (u) 
through  previously  learned  pathways.  List  chunks  send  excitatory  feedback  down 
to  their  item  source  nodes.  Bottom-up  2ind  top-down  pathways  are  modulated  by 
habituative  transmitter  gates  (filled  squares).  Item  nodes  receive  input  in  an  on- 
center  off-surround  anatomy.  Total  input  (/)  is  averaged  to  control  an  item  rate 
signal  (r)  that  adjusts  the  working  memory  gain  (5).  Excitatory  paths  eire  marked 
with  arrowheads,  inhibitory  paths  with  small  open  circles. 


and  syllable  initial  (CV)  transitions  determines  whether  consonants  are  segregated, 
i.e.,  perceived  as  distinct,  or  integrated,  i.e.  fused  into  a  single  percept.  Hearing  two 
stops  in  a  VC-CV  pair  that  are  phonetically  the  same,  as  in  “top  pick,”  requires 
about  150  msec  more  closure  time  than  hearing  two  stops  in  a  VC1-C2V  pair  that 
are  phonetically  different,  as  in  “odd  ball.”  As  shown  by  Repp  [10],  when  the  dis¬ 
tribution  of  closure  intervals  over  trials  is  experimentally  varied,  subjects’  decision 
boundaries  between  one-stop  and  two-stop  percepts  always  occurred  near  the  mean 
closure  interval. 

The  ARTPHONE  model  traces  these  properties  to  dynamical  interactions  between 
a  working  memory  for  short-term  storage  of  phonetic  items  and  a  list  categoriza¬ 
tion  network  that  groups,  or  chunks,  sequences  of  the  phonetic  items  in  working 
memory.  These  interactions  automatically  adjust  their  processing  rate  to  the  speech 
rate  via  automatic  gain  control.  The  speech  code  in  the  model  is  a  resonant  wave 
that  emerges  after  bottom-up  signals  from  the  working  memory  select  list  chunks 
which,  in  turn,  read  out  top-down  expectations  that  amplify  consistent  working 
memory  items.  This  resonance  may  be  rapidly  reset  by  inputs,  such  as  C2,  that  are 
inconsistent  with  a  top-down  expectation,  say  of  Ci;  or  by  a  collapse  of  resonant 
activation  due  to  a  habituative  process  that  can  take  a  much  longer  time  to  occur, 
as  illustrated  by  the  categorical  boundary  between  VCV  and  VC-CV.  The  catego¬ 
rization  data  may  thus  be  understood  as  emergent  properties  of  a  resonant  process 
that  adjusts  its  dynamics  to  track  the  speech  rate. 

4  Neural  Dynamics  of  Boundary  and  Surface  Representation 

With  my  colleagues  Alan  Gove  and  Ennio  Mingolla  [4],  a  neural  network  model, 
called  a  FACADE  theory  model  (Figure  3),  has  been  developed  to  explain  how 
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Figure  3  FACADE  model  macrocirctut.  Boimdary  representation,  or  BCS,  stages 
aire  designated  by  octagon^ll  boxes,  surface  representation,  or  FCS,  stages  by  rect¬ 
angular  boxes. 


visual  thalamocortical  interactions  give  rise  to  boundary  percepts  such  as  illusory 
contours  and  surface  percepts  such  as  filled-in  brightnesses.  Top-down  feedback 
interactions  are  needed  in  addition  to  bottom-up  feedforward  interactions  to  sim¬ 
ulate  these  data.  One  feedback  loop  is  modeled  between  lateral  geniculate  nucleus 
(LGN)  and  cortical  area  VI,  and  another  within  cortical  areas  VI  and  V2.  The 
first  feedback  loop  realizes  a  resonant  matching  process,  as  in  ART,  which  en¬ 
hances  LGN  cell  activities  that  are  consistent  with  those  of  active  cortical  cells, 
and  suppresses  LGN  activities  that  are  not.  This  corticogeniculate  feedback,  being 
endstopped  and  oriented,  also  enhances  LGN  ON  cell  activations  at  the  ends  of  thin 
dark  lines,  thereby  leading  to  enhanced  cortical  brightness  percepts  when  the  lines 
group  into  closed  illusory  contours.  The  second  feedback  loop  generates  boundary 
representations,  including  illusory  contours,  that  coherently  bind  distributed  corti¬ 
cal  features  together.  Brightness  percepts  form  within  the  surface  representations 
through  a  diffusive  filling-in  process  that  is  contained  by  resistive  gating  signals  from 
the  boundary  representations.  The  model  is  used  to  simulate  illusory  contours  and 
surface  brightnesses  induced  by  Ehrenstein  disks,  Kanizsa  squares,  Glass  patterns, 
and  cafe  wall  patterns  in  single  contrast,  reverse  contrast,  and  mixed  contrast  con- 
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figurations.  These  examples  illustrate  how  boundary  and  surface  mechanisms  can 
generate  percepts  that  are  highly  context-sensitive,  including  how  illusory  contours 
can  be  amodally  recognized  without  being  seen,  how  model  simple  cells  in  VI  re¬ 
spond  preferentially  to  luminance  discontinuities  using  inputs  from  both  LGN  ON 
and  OFF  cells,  how  model  bipole  cells  in  V2  with  two  colinear  receptive  fields  can 
help  to  complete  curved  illusory  contours,  how  short-range  simple  cell  groupings 
and  long-range  bipole  cell  groupings  can  sometimes  generate  different  outcomes,  and 
how  model  double-opponent,  filling-in  and  boundary  segmentation  mechanisms  in 
V4  interact  to  generate  surface  brightness  percepts  in  which  filling-in  of  enhanced 
brightness  and  darkness  can  occur  before  the  net  brightness  distribution  is  com¬ 
puted  by  double-opponent  interactions.  Taken  together,  these  results  emphasize  the 
importance  of  resonant  feedback  processes  in  generating  conscious  percepts  in  the 
visual  brain. 
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Figure  4  SACCART  model  for  multimodal  control  of  saccadic  eye  movements  by 
the  superior  colliculus. 


5  Neural  Dynamics  for  Multimodal  Control  of  Saccadic  Eye 
Movements 

Saccades  are  eye  movements  by  which  an  animal  can  scan  a  rapidly  changing  envi¬ 
ronment.  While  the  saccadic  system  plans  where  to  move  the  eyes,  it  also  retains 
reflexive  responsiveness  to  fluctuating  light  sources.  These  two  types  of  saccades 
ultimately  result  in  control  of  the  same  set  of  eye  muscles.  Visually  reactive  cells 
encode  gaze  error  in  a  retinotopically  activated  motor  map.  Planned  targets  are 
coded  in  head-centered  coordinates.  When  two  conflicting  commands  attempt  to 
share  control  of  the  saccadic  eye  movement  system,  the  system  must  resolve  the 
conflict  and  coordinate  command  of  one  set  of  eye  muscles. 

The  superior  colliculus  is  a  brainstem  region  that  plays  a  prominent  role  in  both 
planned  and  reactive  saccades.  This  region  coordinates  information  to  adjust  move¬ 
ments  of  the  head  and  eyes  to  a  stimulus.  In  order  to  combine  these  visual,  somatic, 
and  auditory  saccade  targets  in  the  superior  colliculus,  the  targets  in  head-centered 
coordinates  are  mapped  to  a  gaze  motor  error  in  retinotopic  coordinates. 
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How  does  the  saccadic  movement  system  select  a  target  when  visual  and  planned 
movement  commands  differ?  How  do  retinal,  head-centered,  and  motor  error  co¬ 
ordinates  interact  during  the  selection  process?  How  are  these  coordinate  systems 
rendered  consistent  through  learning?  Recent  data  on  superior  colliculus  (SC)  re¬ 
veal  a  travelling  wave  of  activation  whose  peak  codes  the  current  gaze  error  [9]. 
In  contrast,  Waitzman  [12]  found  that  the  locus  of  peak  activity  in  SC  remains 
constant  while  the  activity  level  at  this  locus  decays  as  a  function  of  residual  gaze 
error.  Why  do  these  distinct  cell  types  exist? 

With  my  colleagues  Mario  Aguilar,  Dan  Bullock,  and  Karen  Roberts  [8],  a  neural 
network  model  has  been  developed  that  answers  these  questions  while  providing  a 
functional  rationale  for  both  signal  patterns  (Figure  4).  The  model  assumes  that 
calibration  between  visual  inputs  and  eye  movement  commands  is  learned  early  in 
development  within  a  visually  reactive  saccade  system  [7] .  Visual  error  signals  coded 
in  retinotopic  coordinates  calibrate  adaptive  gains  to  achieve  accurate  foveation. 
The  accuracy  of  planned  saccades  derives  from  using  the  gains  learned  by  the  re¬ 
active  system.  For  this,  a  transformation  between  a  planned  head-centered  and  a 
retinotopic  target  representation  needs  to  be  learned.  ART  matching  and  resonance 
control  the  stability  of  this  learning  and  the  attentive  selection  of  saccadic  target 
locations.  Targets  in  retinotopic  and  head-centered  coordinates  are  rendered  dimen¬ 
sionally  consistent  so  that  they  can  compete  for  attention  to  generate  a  movement 
command  in  motor  error  coordinates.  Simulations  show  how  a  decaying,  stationary 
activity  profile  is  obtained  in  the  pre-motor  layer  due  to  feedback  modulation.  In 
addition,  a  travelling  wave  activity  profile  is  produced  in  the  motor  layer  due  to 
modulation  from  the  pre-motor  layer  and  the  nature  of  the  local  connectivity.  The 
simulations  show  how  this  model  reproduces  physiological  data  of  these  two  classes 
of  collicular  neurons  simultaneously  during  a  variety  of  behavioral  tasks  (e.g.,  vi¬ 
sual,  memory,  gap,  and  overlap  conditions),  an  achievement  previously  unattained 
by  eye  movement  models.  In  addition,  the  model  also  clarifies  how  the  SC  integrates 
signals  from  multiple  modalities,  and  simulates  collicular  response  enhancement  and 
depression  produced  by  multimodal  stimuli  [11]. 

6  Concluding  Remarks 

In  addition  to  these  systems  are  the  more  familiar  ART  systems  for  explaining  vi¬ 
sual  object  recognition  and  its  breakdown  due  to  hippocampal  lesions  that  lead  to 
medial  temporal  amnesia  [2].  Taken  together,  these  results  provide  accumulating 
evidence  that  the  brain  parsimoniously  specifies  a  small  set  of  computational  prin¬ 
ciples  to  ensure  its  stability  and  adaptability  in  responding  to  many  different  types 
of  environmental  challenges. 
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ON-LINE  TRAINING  OF  MEMORY-DRIVEN 

ATTRACTOR  NETWORKS 

Morris  W.  Hirsch 

Department  of  Mathematics,  University  of  California  at  Berkeley,  USA. 

A  rigorous  mathematical  analysis  is  presented  of  a  class  of  continuous  networks  having  rather 
arbitrary  activation  dynamics,  with  input  patterns  classified  by  attractors  by  a  special  training 
scheme.  Memory  adapts  continually,  whether  or  not  a  training  signal  is  present.  It  is  shown 
that  consistent  input-output  pairs  can  be  learned  perfectly  provided  every  pattern  is  repeated 
sufficiently  often,  and  input  patterns  are  nearly  orthogonal. 

1  Introduction 

Most  neural  networks  used  for  pattern  classification  have  the  following  features: 

■  The  training  dynamics  is  separate  from  the  activation  dynamics. 

■  Patterns  are  classified  by  fixed  points  or  limit  cycles  of  the  activation  dynamics. 

But  biological  neural  networks —  nervous  systems —  do  not  conform  to  these  rules. 
We  learn  while  doing —  we  learn  by  doing,  and  if  we  don’t  do,  we  may  forget.  And 
chaotic  dynamics  is  the  rule  in  many  parts  of  the  cerebral  cortex  (see  the  papers 
by  Fteeman  et  ai). 

Here  we  look  at  supervised  learning  in  continuous  time  nets  in  which: 

■  The  memory  matrix  is  always  adapting:  the  only  difference  between  training 
and  testing  is  the  presence  of  a  training  signal.  Testing  reinforces  learning,  lack 
of  testing  can  lead  to  forgetting,  and  retraining  at  any  time  is  possible. 

■  Patterns  are  classified  by  possibly  chaotic  attractors. 

■  Training  is  completed  on  any  single  pass  through  the  training  set. 

The  fundamental  problem  faced  by  such  a  system  is  that  presentation  of  new  in¬ 
put  patterns  affects  the  memory  of  old  patterns,  whether  or  not  a  training  signal 
is  present;  consequently  there  is  a  tendency  for  previously  trained  memories  to 
degrade.  To  prevent  this  from  occurring,  we  constrain  the  system  in  two  ways: 

■  Each  input  pattern  is  repeated  sufficiently  frequently,  with  or  without  a  train¬ 
ing  signal. 

■  Input  patterns  are  nearly  orthogonal. 

When  the  input  patterns  are  sufficiently  orthogonal,  depending  on  the  number  of 
patterns,  then  system  parameters  can  be  chosen  robustly  so  that  the  system  learns 
to  give  the  correct  output  for  each  input  pattern  on  which  it  has  consistently  trained. 
The  net  comprises  an  input  layer  of  d  units  which  feeds  into  a  recurrently  connected 
activation  layer  of  n  units.  Inputs  and  activations  can  take  any  real  number  values. 
Inputs  are  drawn  from  a  fixed  list  of  patterns,  taken  for  convenience  to  be  unit 
vectors.  With  any  input  a  training  signal  may  be  presented  for  minimum  time  and 
then  shut  off.  These  training  signals  need  not  be  consistent,  but  those  patterns 
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that  are  trained  consistently  and  tested  sufficiently  frequently,  will  respond  with 
the  correct  output. 

Rather  than  taking  outputs  to  be  the  usual  stable  equilibria  or  limit  cycles,  we 
consider  outputs  to  be  the  attractors  (more  precisely,  their  basins)  to  which  the 
activation  dynamics  tends.  (Compare  [2],  [4]).  Each  training  signal  directs  the  ac¬ 
tivation  dynamics  into  the  basin  of  an  attractor. 

In  some  biological  models  it  is  the  attractor  as  a  geometric  object,  rather  than 
the  dynamics  in  the  attractor,  that  classifies  the  input.  If  the  components  of  the 
activation  vector  x  represent  firing  rates  of  cells,  then  the  attractor  may  corresond 
to  a  particular  cell  assembly,  and  the  useful  information  is  which  cell  assembly  is 
active,  with  dynamical  details  being  irrelevant.  In  this  connection  compare  [8];  [1]. 
Before  giving  details  of  the  dynamics,  we  first  describe  how  the  network  looks  from 
the  outside.  From  this  black  box  point  of  view  the  activation  dynamics  appears  to 
be  governed  by  a  differential  equation  in  Euclidean  n-space  IR”, 

I  =  w 

=  -x-\-  f(x)  -f  7  (2) 

where  /  :  IR”  IR”  is  bounded  and  I  €  IR”  is  a  constant  training  signal  which  may 
be  0.  Input  patterns  are  chosen  from  a  finite  set  of  distinct  vectors  6  IR^,  a  = 
1, . . . ,  m.  Training  signals  are  selected  from  a  compact  set  S  C  IR”  which  contains 
the  origin. 

At  stage  k  of  running  the  net,  ah  input  and  a  possibly  null  training  signal  are 
presented  simultaneously  to  the  net  for  a  certain  time  interval  called  the 

A:’th  instruction  period,  followed  by  the  computation  period  [t'f.,tk]  during  which 
training  signal  is  removed  (set  to  0)  while  the  same  input  is  present.  The  activation 
vector  X  then  settles  down  to  an  attractor  for  the  dynamics  of  the  vector  field  F, 
that  is,  for  the  differential  equation 

^  =  -x  +  fix),  (3) 

When  this  process  is  repeated  with  a  different  input  and  training  signal,  the  acti¬ 
vation  variable  x  jumps  instantaneously  to  a  new  initial  position. 

To  each  pattern  there  is  associated  a  special  attractor  C  IR”  for  (3),  and  a 
target  U^,  which  is  a  positively  invariant  neighborhood  of  A®  in  B(A®).  It  is  desired 
that  when  f®  is  input,  x{t)  goes  into  [/®  and  stays  there  (thus  approaching  A®), 
until  the  input  is  changed. 

Training  consists  in  using  training  signals  from  a  special  subset  5®  C  5  of  proper 
training  signals  asociated  with  ^®.  Other  nonzero  training  signals  are  improper. 

It  turns  out  that  with  sufficiently  orthogonal  input  patterns  and  suitable  parame¬ 
ters,  any  pattern  that  is  trained  on  a  proper  training  signal  at  stage  k  will  go  to  its 
correct  target  at  a  later  stage  provided  that  at  intervening  stages  the  pattern  was 
presented  sufficiently  often  and  no  improper  training  signals  were  used. 

Now  we  explain  how  the  memory  evolves.  The  time  varying  memory  matrix  M  € 
IRnxd  static  input  patterns  (column  vectors)  ^  to  dynamic  activation  vectors 
X  by  x{t)  =,  M{t)^.  An  auxiliary  fast  variable  y  G  IR"  provides  a  delay  between 
x{t)  and  M(t):  The  true  dynamics  is  given  by  the  following  Main  System,  where 
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denotes  the  transpose  of 

^  +  (4) 

=  f(x)-y,  (5) 

X  =  M^.  (6) 

The  system’s  dynamics  is  thus  driven  by  M{t);  there  is  no  independent  activation 
dynamics. 

Computing  dx/dt  and  using  the  fact  that  =  ||f|p  =  1,  we  obtain  the  following 
system: 

^  =  -I  +  SZ  +  Z,  (7) 

aJ  =  /(x)-!/.  (8) 

This  system  is  independent  of  the  input  but  the  interpretation  of  x  depends  on 
C  and  X  (but  not  y)  jumps  dis continuously  when  ^  is  changed. 

It  is  interesting  to  observe  that  System  (7),  (8)  is  equivalent  to  a  single  second  order 
equation  in  x,  namely: 

similar  to  the  second  order  activation  dynamics  used  by  W.  Freeman  et  al.  (see 
References).  The  following  fact  is  used: 

Proposition  1  If  x{t)  obeys  Equation  (9)  and  ||/(a7)||  <  p,  then  x(t)  — ^  Np(I). 

Standard  dynamical  system  techniques  (singular  perturbations,  invariant  mani¬ 
folds)  show  that  the  dynamics  of  System  (7),  (8)  is  closely  approximated  by  the 
that  of  the  simpler  system 

^  =  -x  +  /(x)  +  7  (10) 

provided  A  is  small  enough. 

Example  1  A  simple  but  interesting  system  of  this  type  has  1-dimensional  acti¬ 
vation  dynamics.  For  /  we  take  any  smooth  (Lipschitz  also  works)  sigmoid  with 
high  gain  1,  having  limiting  values  ±1.  The  dynamics  of  (10)  with  1  =  0 

has  stable  equilibria  at  (approximately)  x  =  ±1  and  an  unstable  equilibrium  at 
x  =  0.  The  2-dimensional  dynamics  of  System  (7),  (8)  with  /  =  0  has  two  attract¬ 
ing  equilibria  near  (1,1)  and  (-1,-1),  and  an  unstable  equilibrium  near  (0,0). 
Every  trajectory  tends  to  one  of  these  three  equilibria.  As  training  signals  we  take 
I_  =  -1,  =  +1.  When  I  takes  either  of  these  values  {x,y)  settles  to  the 

global  attractor  near  (27“,2/“).  When  I  is  then  set  to  0,  keeping  the  same  input 
pattern,  then  (x,  y)  relaxes  toward  (/“,/“).  This  system  learns  to  sort  any  number 
of  sufficiently  orthogonal  input  patterns  into  two  arbitrary  classes  after  a  single 
pass  through  the  pattern  set.  A  similar  system  is  treated  in  detail  in  [9]. 

Higher  dimensional  examples  can  be  obtained  from  this  one  by  taking  Cartesian 
products  of  n  such  vector  fields,  yielding  2”  attracting  equilibria.  More  interesting 
dynamics  can  be  constructed  by  changing  the  vector  field  in  small  neighborhoods 
of  the  attracting  equilibria.  In  this  way  attractors  with  arbitrary  dynamics  can  be 
constructed. 
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2  Statement  of  Results 

Notation.  =  \/ir^  denotes  the  Euclidean  norm  of  vector  it.  The  closed  ball 
of  radius  p  about  a  point  z  is  denoted  by  Np{z).  The  closed  p-neighborhood  of  a 
set  A  is  K(^)- 

We  start  with  the  data:  f,  {A^},  {U^},  {5“}. 

We  assume  given  an  increasing  sequence  of  times  k  eJN.  The  Main 

System  is  run  during  stage  k  as  described  above,  taking  as  initial  values  at  tk-i 
whatever  the  terminal  values  were  in  the  previous  stage  (if  it  >  1). 

We  make  the  following  asumptions,  in  terms  of  parameters  e  >  0,T  >  0,N  > 

1,R>  0,p  >  0: 

Hypothesis  2 


Unit  pattern  vectors:  ||^'*||  =  1. 

Separation  of  patterns:  <  e  for  distinct  a,  b 

Duration  of  presentation:  -  tk-i  >  T  and  tk  -t'^>T  for  all  k. 

Frequency  of  presentation:  Each  pattern  C  is  used  as  input  at  least  once  in 
every  N  successive  stages. 

Initial  values:  ]y{0)\  <  1,  ||M(0X“||  <  R. 

Bound  on  /:  ||/(a;)||  <  p. 

Proper  training  signals:  5“  =  {/  G  §  :  Np{I)  C 

The  following  theorems  refer  to  the  main  system.  They  mean  that  robust  parameters 
can  be  chosen  so  that  that  whenever  the  pattern  from  a  consistent  pair  is  input 
after  being  properly  trained  once,  the  output  is  always  correct,  whether  or  not  a 
training  signal  is  input,  and  regardless  of  the  training  signals  when  other  patterns 
are  input;  and  moreover,  all  the  variables  stay  within  given  bounds. 

Define  time- varying  vectors  =  M{t)^^  G  IR—  the  net’s  current  recall  of  input 

A  pattern  is  consistently  trained  from  stages  k  to  I  provided  that  at  stage  k 
the  input  is  and  the  training  signal  is  proper,  while  at  each  stage  r,  k  <  r  <  I 
at  which  is  input, the  training  signal  is  either  proper  or  0  (both  may  occur  at 
different  stages). 

tests  successfully  during  stages  ^  to  /  provided  x^{tj)  G  whenever  k  <  j  <  I 
and  the  input  for  stage  j  is  . 

Recall  that  A  is  the  time  constant  in  Equation  (5). 

Theorem  3  There  exist  positive  constants  independent  o/AT,  e,  and 

computable  from  the  data,  with  the  following  property.  Assume  0  <  A  <  A*  and 
Hypothesis  2  holds  with  T  >  T^ ,  R  >  R^,  and  in  addition  eN  <  r*.  Then  every 
pattern  which  is  consistently  trained  from  stages  k  to  I,  tests  successfully  during 
those  stages. 


The  quantity 

plays  a  key  role.  Notice  that  0 


e  =  eN{l  -H 
— )•  0  as  eN”  — )■  0. 
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Theorem  4  There  exist  positive  constants  T*,  i?*,  A*,  independent  ofe,  N,  and 
computable  from  the  data,  with  the  following  property.  IfO  <  A  <  A*  and  Hypothesis 
2  holds  with  T  >Ti^,R>  then: 

(11) 

for  allt>0,  a  =  1, . .  .,m. 

It  is  clear  from  (4)  that  if  a  vector  y  6  is  orthogonal  to  all  the  input  patterns, 
then  M{t)r]  is  constant.  Therefore  the  Theorem  4  implies  that  the  entries  in  M{t) 
are  uniformly  bounded. 

The  key  to  the  proofs  is  the  following  lemma.  Suppose  the  input  is  $  =  in  System 
(4),  (5), (6).  Computing  dx^/dt  shows  dx^/dt  =  ■^^)dx^/dt,  yielding  the  crucial 

estimate: 

Lemma  5  If  the  input  is  and  h  a,  then  for  ti  >  to  >  0  we  have: 

loiV)  -  <  e|a;‘‘(si)  -  ^^“(^0)1.  (12) 


This  means  that  after  presentation  of  subsequent  presentation  of  does  not 
alter  the  net’s  recall  of  by  very  much  provided  e  is  sufficiently  small. 

3  Discussion 

Theorem  3  means  that  each  consistently  trained  input  pattern  yields  the  correct 
output  attractor  regardless  of  the  training  of  the  other  patterns,  which  may  be 
untrained  or  even  inconsistently  trained.  Moreover  the  system  can  be  trained  or 
retrained  on  any  patterns  without  impairing  the  previous  training  of  the  other 
patterns. 

Example  2  An  example  which  illustrates  the  difficulty  in  verifying  the  assumption 
on  proper  training  signals  in  Hypothesis  2  is  obtained  by  replacing  /  in  Example 
1  by  fi{x)  =  f{x)  +  f{x  —  1).  (If  f  approximates  a  step  function,  then  fi  approxi¬ 
mates  a  staircase  function.)  For  high  gain  the  vector  field  —x-\-fi{x)  has  attracting 
equilibria  very  near  —2,  0  and  2,  and  unstable  equilibria  near  —1  and  1.  The  basin 
of  the  attraction  for  the  equibrium  near  0  is  contained  in  the  interval  (—1, 1).  As 
we  must  take  p  «  2,  Hypothesis  2  is  violated  because  7^2(0)  ^  (—1, 1). 

The  requirement  that  input  patterns  be  unit  vectors  is  made  only  for  simplicity.  If 
they  are  not  unit  vectors  then  the  assumption  on  separation  of  distinct  patterns  in 
Hypothesis  2  is  changed  to 

lien  ’ 

with  similar  results.  The  dimension  d  of  the  input  space  is  irrelevant  to  our  results. 
In  fact  the  input  patterns  could  be  members  of  an  arbitrary  inner  product  space. 
It  is  not  hard  to  see  that  some  restrictions  on  the  geometry  of  the  patterns  is 
necessary.  For  example,  a  set  of  linearly  dependent  patterns  severely  constrains  the 
associated  outputs  that  can  be  learned. 

The  training-testing  schedule  is  also  crucial.  Simple  examples  show  that,  as  with 
natural  learning  systems,  if  a  pattern  is  not  repeated  sufficiently  often  it  may  be 
forgotten. 

Interesting  stochastic  questions  arise.  Suppose,  for  example,  that  instead  of  consid¬ 
ering  consistently  trained  patterns,  we  allow  the  wrong  teaching  signal  to  be  given 
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with  small  positive  probability.  Is  it  true,  as  seems  likely,  that  under  some  similar 
training  scheme  there  is  a  high  probability  of  correct  output? 

Some  other  issues:  Is  incremental  learning  possible?  Can  such  nets  be  usefully  cou¬ 
pled?  Can  more  than  one  layer  of  memory  be  trained  this  way?  Is  there  a  good  way 
to  approximately  orthogonalize  inputs  by  preprocessing? 
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A  beginning  is  being  made  on  the  hard  problem  of  constructing  an  artificial  brain,  to  try  to  bridge 
the  gap  between  neurophysiology  and  psychology.  The  approach  uses  especially  the  increasing 
number  of  results  obtained  by  means  of  non-invasive  instnunents  (EEG,MEG,PET,fMRI),  and 
the  related  psychological  tasks.  The  paper  describes  the  program  and  some  of  the  mathematical 
problems  that  it  is  producing.  In  particulcir  the  class  of  problems  associated  with  the  activities  of 
various  coupled  modules  used  in  higher  order  control  and  attention  will  be  discussed.  These  include 
posterior  attentional  coupled  systems,  and  the  anterior  motor/action/attention  “ACTION”  net¬ 
works,  based  on  biological  circuits.  The  relevance  of  these  modules,  and  the  associated  problems, 
for  higher  order  cognition,  axe  emphasised. 

1  Introduction 

The  Harvard  neurologist  Gerald  Fishbach  stated  recently  “The  human  brain  is  the 
most  complicated  object  on  earth” .  That  does  indeed  seem  to  be  the  case,  and  for 
a  considerable  period  of  time  since  the  beginning  of  the  scientific  revolution  this 
fact  seems  to  have  led  to  despair  in  ever  being  able  to  have  a  reasonable  compre¬ 
hension  of  it.  However  there  is  now  a  strong  sense  of  optimism  in  the  field  of  brain 
sciences  that  serious  progress  is  being  made  in  the  search  for  comprehension  of 
the  mechanisms  used  by  the  brain  to  achieve  its  amazing  powers,  even  up  to  the 
level  of  consciousness.  Part  of  this  optimism  stems  from  the  discovery  and  devel¬ 
opment  of  new  windows  with  which  to  look  at  the  brain,  with  large  groups  now 
being  established  to  exploit  the  power  of  the  available  instruments.  There  are  also 
new  theories  of  how  neurons  might  combine  their  activities  to  produce  mind-like 
behaviour,  based  on  recent  developments  in  the  field  of  artificial  neural  networks. 
At  the  same  time  there  is  an  ever  increasing  understanding  of  the  subtleties  of  the 
mechanisms  used  by  living  neurons.  It  would  thus  appear  that  the  mind  and  brain 
are  being  attacked  at  many  levels.  The  paper  attempts  to  contribute  towards  this 
program  by  starting  with  a  brief  survey  of  the  results  of  non-invasive  instruments. 
Then  it  turns  to  consider  the  inverse  problem  before  discussing  the  nature  of  past 
global  brain  models.  An  overview  is  then  given  of  the  manner  in  which  neural 
modelling  may  attempt  to  begin  to  achieve  perception.  Problems  of  neuropsychol- 
gical  modelling  are  then  considered,  with  the  breakdown  of  tasks  into  component 
sub-processes.  Mathematical  problems  associated  with  constructing  neural  modules 
to  achieve  the  required  functionality  of  sub-process  are  then  discussed.  Modules  to 
produce  possible  conscious  processing  are  then  considered,  and  the  paper  concludes 
with  a  summary  and  discussion  of  further  questions. 

2  Non-Invasive  Results 

There  are  amazing  new  windows  on  the  mind.  These  measure  either  the  magnetic 
or  electric  fields  caused  by  neural  activity,  or  the  change  in  blood  flow  in  active 
regions  of  the  brain  to  allow  oxygen  to  be  brought  to  support  the  neural  activity. 
The  former  techniques  are  termed  magnetoencephalography  (MEG)  and  electroen- 
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cephalography  (EEG),  the  latter  positron  emission  tomography  (PET)  and  func¬ 
tional  magnetic  resonance  imaging  (fMRI)  respectively.  The  techniques  of  MEG, 
EEG  and  fMRI  are  truly  non-invasive  since  there  is  no  substance  penetrating  the 
body,  whilst  that  is  not  so  with  PET,  in  which  a  subject  imbibes  a  radioactive 
material  which  will  decay  rapidly  with  emission  of  positrons  (with  their  ensuing 
decay  to  two  photons). 

The  methods  have  different  advantages  and  disadvantages.  Thus  EEG  is  a  very 
accurate  measure  of  the  time  course  of  the  average  electrical  current  from  an  ag¬ 
gregate  of  neurons  firing  in  unison,  but  is  contaminated  with  conduction  currents 
flowing  through  the  conducting  material  of  the  head.  Thus  although  temporal  ac¬ 
curacy  of  a  few  milliseconds  can  be  obtained  spatial  accuracy  is  mainly  limited  to 
surface  effects,  and  deep  currents  are  difficult  to  detect  with  certainty.  MEG  does 
not  have  the  defect  of  being  broadened  out  spatially  by  the  conduction  currents, 
but  has  the  problem  of  spatial  accuracy,  which  is  slowly  becoming  resolved  (but 
also  has  a  more  difficult  one  of  expense  due  to  the  need  for  special  screening  and 
low  temperature  apparatus  for  the  SQUID  devices).  fMRI  also  has  considerable 
expense  and  low  temporal  accuracy.  Even  though  the  activity  from  each  slice  of 
brain  may  be  assessed  in  an  fMRI  machine  within  about  100  or  so  milliseconds  the 
blood  flow  related  to  the  underlying  brain  activity  still  requires  about  7  seconds 
to  reach  its  peak.  There  has  been  a  recent  suggestion  that  hypo-activity  from  the 
deoxygenated  blood  will  ‘peak’  after  only  about  1  or  2  seconds  of  input  [1],  but 
that  is  still  to  be  confirmed.  Finally  PET  also  has  the  same  problems  of  slow  blood 
flow  possessed  by  fMRI,  but  good  spatial  accuracy.  Combining  two  or  more  of  the 
above  techniques  together  makes  it  possible  to  follow  the  temporal  development  of 
spatially  well  defined  activity  [2].  The  results  stemming  from  such  a  combination 
are  already  showing  clear  trends. 

The  most  important  result  being  discovered  by  these  techniques  is  that 

1.  activity  of  the  brain  for  solving  a  given  task  is  localised  mainly  in  a  distinct 
network  of  modules, 

2.  the  temporal  flow  of  activity  between  the  modules  is  itself  very  specific, 

3.  the  network  used  for  a  given  task  is  different,  in  general,  from  that  used  for 
another  task. 

It  is  not  proposed  to  give  here  a  more  detailed  description  of  the  enormous  number 
of  results  now  being  obtained  by  these  techniques,  but  refer  the  reader,  for  example, 
to  the  recent  book  of  [2],  or  the  excellent  discussion  in  [3].  However  it  is  important 
finally  to  note  that  the  fact  there  are  distinctive  signatures  that  can  be  picked 
up  by  these  instruments  indicate  that  aggregates  of  nerve  cells  are  involved  in 
solving  the  relevant  tasks  by  the  subjects.  Thus  population  activity  is  involved, 
and  “grandmother”  cells  are  not  in  evidence.  Of  course  if  they  were  important  they 
would  slip  through  the  net.  However  that  there  are  such  characteristic  signatures 
seems  to  show  that  the  net  is  sharp  enough  to  detect  relevant  aggregated  activity. 

3  Inverse  Problems 

There  has  been  much  work  done  to  decipher  the  detailed  form  of  the  underlying 
neural  activity  that  gives  rise  to  a  particular  MEG  or  EEG  pattern,  the  so-called 
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“inverse  problem”.  Various  approaches  have  been  used  to  read  off  the  relevant 
neural  processing,  mainly  in  terms  of  the  distribution  of  the  underlying  sources 
of  current  flow.  This  latter  might  be  chosen  to  be  extremely  discrete,  as  in  the 
case  of  the  single  dipole  fits  to  some  of  the  data.  However  more  complete  current 
flow  analyses  have  been  performed  recently,  as  in  the  case  of  MFT  [4].  This  leads 
to  a  continuous  distribution  of  current  over  the  brain,  in  terms  of  the  lead  fields 
describing  the  effectiveness  of  a  given  sensor  to  detect  current  at  the  position  of  the 
field  in  question. 

It  has  been  possible  to  extend  this  approach  to  use  cross  entropy  minimisation  so  as 
to  determine  the  statistics  of  the  current  density  distribution  in  terms  of  the  covari¬ 
ance  matrix  of  the  measurements  and  the  sensor  mean  results  [5] .  This  approach  is 
presently  being  extended  to  attempt  to  incorporate  the  iteration  procedure  in  the 
MFT  method  of  [4]  so  as  to  make  the  approach  more  effective  [6]. 

There  are  further  deep  difficulties  with  the  inverse  problem.  Thus  one  has  to  face 
up  to  the  question  of  single  trials  versus  averaged  values.  Averaging  has  especially 
been  performed  on  the  EEG  data  so  as  to  remove  the  noise  present.  However  in  the 
case  of  successful  task  performance  the  subject  must  have  used  activity  of  a  set  of 
modules,  in  a  single  trial,  in  a  way  which  avoided  the  noise.  This  is  an  important 
question  that  will  be  taken  up  in  the  next  section:  how  is  successful  performance 
achieved  in  an  apparently  noisy  or  chaotic  dynamical  system  such  as  the  brain  [7]? 
Averaging  only  removes  the  crucial  data  indicating  as  to  how  the  solution  to  this 
task  might  be  achieved.  But  single  trials  appear  to  be  very,  very  noisy.  How  do 
modules  share  in  the  processing  so  as  to  overcome  the  variability  of  neural  activity 
in  each  of  them? 

One  way  to  begin  to  answer  this,  and  the  earlier  problem  about  the  inverse  problem, 
is  to  model  the  neural  activity  in  a  direct  manner.  In  other  words  the  actual  flow 
of  nerve  impulses,  and  their  resultant  stimulation  of  other  neural  modules,  must  be 
attempted.  The  resulting  electric  and  magnetic  fields  of  this  neural  activity  must 
then  be  calculated.  This  approach  will  allow  more  knowledge  to  be  inserted  in  the 
model  being  built  than  by  use  of  methods  like  MFT,  and  so  help  constrain  the 
solution  space  better. 

4  Neural  Modules  for  Perception 

It  is  possible  to  develop  a  program  of  modelling  for  the  whole  brain,  so  that  the 
spatial  and  temporal  flow  patterns  that  arise  from  given  inputs  can  be  simulated. 
That  also  requires  the  simulation  of  reasonably  accurate  input  sensors,  such  as  the 
retina,  the  nose  or  the  cochlea.  There  are  already  simple  neural  models  of  such 
input  processors  (such  as  in  [8]),  so  that  with  care  the  effects  of  early  processing 
could  already  be  included. 

The  order  of  work  then  to  be  performed  is  that  of  the  development  of  simple  neural 
modules,  say  each  with  about  a  thousand  neurons,  and  connected  up  to  others  in 
a  manner  already  being  determined  from  known  connectivities,  such  as  that  of  the 
macaque  [9].  An  important  question  of  homology  has  to  be  solved  here,  but  there  is 
increasing  understanding  of  how  to  relate  the  monkey  brain  to  that  of  the  human, 
so  that  problem  may  not  be  impossible.  There  is  also  the  difficulty  of  relating  the 
resulting  set  of,  say,  a  hundred  simplified  modules  to  their  correct  places  on  the 
cortical  surface  (and  to  appropriate  sub-cortical  places)  for  a  particular  human 
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subject.  Even  assuming  the  cortical  surface  is  known  from  an  MRI  scan,  the  actual 
placing  of  the  modules  corresponding  to  different  areas  might  be  difficult.  However 
this  may  be  tackled  by  considering  it  as  a  task  of  minimization  of  the  mismatch 
between  the  predictions  of  the  model  and  the  actual  values  of  EEG  and  MEG  signals 
during  a  range  of  input  processing,  and  including  sub-cortical  structures,  such  as 
discussed  in  [10]. 

5  Neuropsychological  Modelling 

There  is  a  big  gap  between  neurons  in  the  brain  and  the  concepts  which  they  sup¬ 
port.  The  job  of  explaining  how  the  latter  arise  from  the  former  is  formidable. 
Indeed  it  is  even  harder  to  consider  the  ultimate  of  experiences  supported,  that  of 
consciousness  itself.  One  approach  has  been  to  consider  the  brain  as  carrying  out 
simple  computations.  However  there  has  recently  been  an  attack  on  the  notion  of 
the  brain  as  a  ‘computational’  organ,  in  the  sense  that  “nervous  systems  represent 
and  they  respond  on  the  basis  of  the  representations”  [11].  It  has  been  suggested, 
instead,  that  brains  do  not  perform  such  simple  computations.  Thus  “It  is  shown 
that  simplified  silicon  nets  can  be  thought  of  as  computing,  but  biologically  realistic 
nets  are  non-computational.  Rather  than  structure  sensitive  rules  governed  opera¬ 
tions  on  symbolic  representations,  there  is  an  evolution  of  self-organising  non-linear 
dynamic  systems  in  a  process  of  ‘differing  and  deferring’  ”  [7]. 

This  claim  of  non-computationality  of  the  brain  is  supported  by  appeals  to  the 
discovery  of  chaos  in  EEG  traces,  and  also  that  low-dimensional  motion  may  be 
involved  in  the  production  of  percepts  in  ambiguity  removal.  However  it  is  well 
known  that  neural  systems  are  non-linear,  and  that  such  systems  can  easily  have 
chaotic  motions  if  forced  into  certain  parameter  regimes.  Thus  the  dynamics  of  the 
whole  brain,  written  as  the  dynamical  system 

dXjdi  =  F{X,  a) 

in  terms  of  the  high  dimensional  state  vector  X,  is  expected  to  have  chaotic  motion 
for  some  of  the  parameter  range  a,  on  which  the  vector  field  F  depends.  Thus  X 
could  denote  the  set  of  compartmental  membrane  potentials,  and  F  denotes  the  non¬ 
linear  Hodgkin-Huxley  equations  for  the  production  of  the  nerve  impulses.  However 
there  are  still  expected  to  be  “representations”  brought  about  by  the  changes  of  the 
function  F  by,  for  example,  synaptic  modifications  caused  by  learning.  If  there  is  no 
change  in  the  transform  function  F  there  seems  to  be  no  chance  for  learning.  Hence 
the  claim  of  [7]  seems  to  be  incorrect  when  the  notion  of  representation  is  extended 
to  the  more  general  transform  function  F.  This  uses  the  notion  of  coupled  modules 
in  the  whole  brain,  as  are  observed  to  be  needed  by  the  non-invasive  techniques 
mentioned  earlier.  Thus  the  possibility  of  chaos  will  not  be  regarded  as  problematic 
to  the  program  being  proposed. 

It  is  possible  to  build  neural  modules  to  model  some  of  the  important  processes 
observed  in  psychology.  Thus  the  various  sorts  of  memory  and  other  processes  may 
be  modelled  as; 

■  working  memory  (as  a  set  of  neuron  buffers) 

■  semantic  memory  (as  an  associative  memory) 

■  episodic  memory  (as  a  recurrent  net) 
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■  attention  (as  a  lateral  inhibitory  net) 

■  action  (by  means  of  a  population  vector  encoding) 

This  is  an  incomplete  list,  but  can  be  begun  to  be  made  complete.  However  in  order 
to  do  that  it  would  appear  necessary  to  consider  the  manner  in  which  psychological 
tasks  may  themselves  be  broken  down.  That  is  done  in  the  next  section. 

6  Component  Sub-Processes 

The  method  to  be  developed  here  is  to  try  to  decompose  complex  psychological 
functions  into  a  set  of  lower  order  component  sub-processes  (CSPs).  These  latter 
will  attempted  to  be  chosen  so  as  to  be  independent  of  each  other.  That  would 
then  make  the  identification  of  neural  substrates  for  their  performance  easier,  in 
the  sense  that  separate  neural  modules  would  be  expected  to  support  the  different 
CSPs.  Lesion  deficits  would  then  be  expected  to  show  loss  of  different  components 
of  the  complex  tasks,  and  hopefully  even  allow  better  identification  of  the  CSPs 
involved.  Such  a  program  has  already  been  started  for  the  frontal  lobes  in  [12], 
where  the  supposed  executive  function  action  of  that  region  has  been  attempted  to 
be  decomposed  into  a  set  of  CSPs  in  the  case  of  attention. 

The  conjecture,  then,  is  that 

Any  complex  psychological  task  is  decomposable  into  a  set  of  indepen¬ 
dent  component  sub-processes,  each  of  which  is  distributed  across  only  a 
few  separate  brain  areas. 

Such  a  conjecture  would  make  the  analysis  of  brain  function  a  problem  of  mapping 
the  corresponding  CSPs  onto  the  appropriate  brain  areas.  There  may  well  be  a 
limit  as  to  how  far  it  is  possible  to  break  down  a  complex  psychological  task  in  a 
manner  which  still  has  psychological  observability.  Thus  it  might  be  supposed  that 
some  small  brain  areas  may  give  a  contribution  to  a  task  which  does  not  have  any 
clearly  observable  effect  on  task  performance.  However  the  principle  of  parsimony 
would  lead  one  to  expect  that  such  areas  would  be  rather  unlikely,  since  they  do 
not  appear  to  possess  much  survival  value  to  the  system  and  so  could  be  dispensed 
with. 

It  is  possible  to  group  complex  tasks  under  the  four  main  headings: 

■  Perception 

■  Movement 

■  Memory 

■  Thought 

It  is  further  seen  that  perception  itself  has  at  least  the  two  component  sub-tasks  of 
construction  of  codes  (in  the  five  separate  modalities)  and  that  of  representations 
(such  as  at  feature,  object,  category,  position,  body  matrix,  lexical,  phonological 
and  word  level).  There  are  numerous  sub-divisions  of  memory  (episodic,  semantic, 
implicit,  priming,  working,  active)  with  considerable  overlap  with  representations 
(which  are  the  objects  of  memory).  There  are  functions  which  are  strongly  asso¬ 
ciated  with  limbic  regions,  such  as  goals,  values,  drives  and  self  representations. 
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There  are  also  frontally-sited  functionSj  as  in  attention,  social  behaviour,  planning 
and  actions  [13]. 

The  above  set  of  functions  is  performed  by  complex  brain  networks,  as  demonstrated 
clearly  in  [2].  It  is  a  very  important  but  difficult  task  to  attempt  to  reduce  this 
broad  range  to  a  smaller  set  of  underlying  elements,  the  CSPs  alluded  to  above. 
The  following  set  is  proposed  as  a  preliminary  version  of  the  required  list: 

■  Analyser  (into  features,  objects  or  categories) 

■  Predictor/Actor  (as  a  generator  of  schema) 

■  Monitor  (for  assessing  goodness  of  fit) 

■  Activator  (to  promote  or  inhibit  on-going  schema  or  other  forms  of  processing) 

■  Buffer  (for  holding  activity  over  a  fixed  or  variable  time) 

■  Clock  (for  estimating  temporal  duration) 

■  Long-term  memory  stores  (both  implicit  and  declarative) 

■  Goal  formation  (involving  value-memory) 

■  Drive  (arising  from  internal  bodily  sources,  and  being  independent  of  the  ac¬ 
tivator) 

There  may  be  other  CSPs  which  must  be  added  to  the  above  list.  Moreover  the  CSPs 
of  a  given  sort  specified  above  may  have  different  properties  amongst  each  other. 
Thus  there  are  buffers  which  hold  activity  for  a  fixed  time  (such  as  the  phonological 
store  [14])  and  those  which  can  preserve  it  over  a  variable  time  of  up  to  30  or  so 
seconds  (associated  specifically  with  frontal  lobe).  Similarly  activators  can  either 
excite  or  inhibit,  and  there  may  thus  be  two  quite  separate  sub-classes  of  modules 
performing  these  two  disparate  functions.  However  the  suggested  list  of  CSPs  given 
above  seems  to  be  of  enough  interest  to  explore  its  use  in  attempting  to  reduce 
the  complexity  of  psychological  tasks  to  more  manageable  proportions.  It  includes 
those  suggested  in  [12]  for  the  frontal  lobe,  where  the  ‘logic  CSP’  introduced  there 
is  assumed  to  be  the  schema  generator  above,  and  compression  of  separate  inhibitor 
and  excitor  modules  has  been  introduced  in  our  list  to  give  a  common  activator 
sub-set. 

The  next  step  is  to  develop  a  neural  underpinning  for  the  mental  processes  at  the 
level  of  the  CSPs  listed  above.  Thus  single  modules  or  small  coupled  nets  of  them 
must  be  constructed  which  (a)  have  a  close  relation  to  neural  modules  in  the  brain 
from  an  anatomical  and  functional  point  of  view  (b)  appear  to  cause  deficits  in 
processing  when  there  is  loss  of  the  relevant  module  (c)  are  seen  to  be  active  in  the 
performance  of  a  task  as  determined  by  some  non-invasive  measurement. 

7  Neural  Modules  for  Component  Sub-Processes 

There  are  already  a  number  of  modules  which  may  be  suggested  to  solve  the  above 
problems  (a),  (b)  and  (c).  Thus 


53 


Taylor:  Mathematics  of  an  Artificial  Brain 

(i)  an  analyser  may  be  designed  froni  a  convolution  filter,  such  as  by  the  use  of  a 
Gabor  function  kernel.  Such  a  filter  has  similarity  to  the  action  of  striate  visual 
cortex,  and  may  be  directly  mapped  onto  the  action  of  simple  cells  in  area  VI, 
say.  There  is  also  a  considerable  level  of  understanding  of  the  manner  in  which 
such  a  filter  may  arise  from  adaptation  to  visual  input. 

(ii)  a  predictor  may  be  constructed  as  a  recurrent  net  of  the  form 

where  such  recurrence  leads  to  the  generation  of  a  sequence  stored  in  the  nature 
of  the  response  function  F,  which  may  also  be  modulated  by  the  external  input 
I.  Identification  of  such  a  system  is  in  terms  of  well  defined  recurrence  of  neural 
output,  as  is  well  known  as  occurring  in  hippocampus,  for  example  [15]. 

(iii)  an  activator  may  be  constructed  as  a  set-point  detector,  say  in  the  form 

x{t  -\-i)  =  y(-/(0  + 

where  Y  is  the  Heaviside  step  function.  Then  neuron  x  is  active  (to  initiate 
search,  say)  if  the  input  I  (which  may  be  some  internal  energy  level,  say)  drops 
below  the  threshold  or  set-point  s.  This  can  be  mapped  onto  various  brain-stem 
systems. 

(iv)  active  memory  ,  comparator  action,  etc:  there  are  several  tasks  associated  with 
actions  which  are  not  so  simple  to  relate  to  actual  neuranatomical  structures, 
although  there  is  now  increasing  work  on  this  area  of  investigation.  Some  of 
these  are  specifically  associated  with  frontal  functions.  The  approach  to  be 
adopted  here  now  is  to  take  a  specific  architecture,  that  of  the  ACTION  net¬ 
work  [16],  and  determine  in  which  manner  it  may  be  able  to  support  the  CSPs 
associated  with  monitoring,  prediction/action  (schema  generation)  and  active 
memory. 

The  ACTION  network  comprises  the  frontal  cortex,  basal  ganglia  and  associated 
thalamic  nuclei.  It  is  decomposable  at  least  into  the  four  main  loops  of  motor 
action,  frontal  eye  fields,  limbic  activity,  and  cognitive  activity  [17].  These  loops  are 
supposed  to  have  a  certain  degree  of  autonomy,  so  might  be  expected  to  play  an 
important  role  in  the  support  of  CSPs. 

The  architecture  of  the  ACTION  net  is  one  of  a  feedback  loop  between  cortex  and 
thalamus  which  has  threshold  modulation  by  a  non-reciprocal  disinhibitory  feed 
from  cortex  down  through  the  basal  ganglia  to  the  same  area  of  thalamus  as  was  in 
contact  with  the  cortex  in  the  first  place.  The  ACTION  network  is  built  to  capitalise 
on  this  structure  so  as  to  allow  for  the  development  of  a  flip-flop  action  between 
cortex  and  thalamus.  Thus  activity  arriving  at  a  cortical  region  from  posterior 
cortex  (parietal  or  temporal  areas  have  important  cortico-cortical  connections  with 
frontal  cortex)  may  cause  the  cortico- thalamic  loop  to  be  activated,  and  stay  on. 
Such  persistence  may  crucially  require  support  from  the  basal  ganglia  disinhibition, 
which  may  be  achievable  by  means  of  cingulate  activation  (as  ‘drive’)  from  some 
set  point  detector,  or  from  some  goal  memory  set  up  there  or  on  another  portion 
of  the  ACTION  network. 

Such  functionality  may  thus  be  seen  to  support  active  memory.  By  the  addition 
of  lateral  inhibition  in  the  two  levels  of  the  basal  ganglia  it  may  also  be  shown  to 
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function  as  a  comparator.  Again  the  successful  performance  by  ACTION  to  achieve 
this  can  be  argued  for,  since  if  there  is  an  active  memory  being  preserved  on  cortex 
and  thalamus  then  a  new,  but  different,  one  arriving  on  cortex  will  produce  lateral 
inhibitory  activity  in  basal  ganglia.  A  competitive  action  is  then  fought  on  the  basal 
ganglia  between  the  earlier  activity  and  that  newly  arriving.  Assuming  that  there 
is  support  for  the  earlier  activity  from  cingulate,  say,  as  part  of  drive,  then  the 
former  activity  will  win  this  competition  on  the  basal  ganglia.  That  means  that  the 
new  activity  will  not  be  able  to  persist  on  cortex,  since  its  threshold  disihibiting 
mechanism  has  been  lost.  Only  new  activity  in  agreement  with  the  past  activity 
will  be  able  to  continue  its  activity  unchecked,  and  contribute  positively  to  the 
previously  stored  activity  on  cortex.  This  latter  may  then  exceed  some  detection 
threshold  to  indicate  that  a  search  has  been  completed  successfully,  say. 

Besides  the  obvious  mathematical  questions  associated  with  the  development  and 
storage  capacity  of  the  proposed  cortico-thalamic  flip-flop  and  of  the  lateral  compet¬ 
itive  system  on  the  basal  ganglia,  there  is  a  more  fundamental  question  that  needs 
answering.  It  is  well  established  experimentally  that  coding  of  action  in  motor  or 
premotor  cortex  is  in  terms  of  a  population  code.  Many  cells  in  motor  cortex  con¬ 
tribute  to  the  determination  of  the  direction  an  action  will  take.  Each  has  its  own 
preferred  direction,  say  and  resulting  activity  proportional  to  [1  +  cos{9  -  q;^)], 
if  9  is  the  direction  of  the  upcoming  movement.  However  it  does  not  seem  easy  to 
preserve  this  activity  level  by  means  of  a  flip-flop  memory,  since  in  that  case  activity 
is  either  on  or  off,  and  is  not  graded.  Therefore  there  seems  to  be  a  contradiction 
between  the  proposed  functioning  of  ACTION  as  a  flip-flop  and  the  presence  of  a 
range  of  preserved  activities  by  population  coding. 

One  way  to  understand  the  mode  of  persistence  of  the  motor  cortex  population 
vector  is  to  suppose  that  the  motor  cortex  has  an  attractor  behaviour  arising  from 
its  structure  as  a  recurrent  net  due  to  the  lateral  connections  it  possesses.  These 
connections  have  been  observed  as  having  weights  anticorrelated  with  the  preferred 
direction  of  movement  vector  for  each  cell;  the  value  Wij  =  cos{ai  -  aj)  would 
roughly  fit  the  data  of  [18.  In  that  case  the  value  of  the  quantity  Wij[l  cos(9  - 
aj)]  may  be  seen  to  be  proportional  to  cos(^-ai)  (with  constant  of  proportionality 
1/3).  In  order  to  obtain  the  additional  term  1  in  the  cortical  neuron  activity  it  seems 
necessary  to  use  disinhibition  from  the  basal  ganglia  side  line.  This  can  be  achieved 
by  the  use  of  summation  of  activity  from  a  set  of  N  cortical  neurons  onto  a  single 
basal  ganglia  neurons  (experimentally  there  is  at  least  a  ratio  of  a  hundred  cortical 
neurons  to  one  basal  ganglia  neuron)  and  thence  to  the  thalamic  cell;  the  weight  of 
the  basal  ganglia  neurons  need  only  be  of  order  1/N  for  preservation  of  the  activity. 
There  now  seems  to  be  two  possible  sources  of  active  memory  in  frontal  cortex. 
One  is  that  developed  in  the  previous  paragraph,  which  uses  a  relaxation  net  of 
recurrent  connections  which  seems  to  fit  the  observed  pattern  of  persistence  in 
motor  cortex.  The  other  is  that  of  the  previous  paragraph,  with  a  set  of  attractors 
developed  by  thalamo-cortical  feedback  connections  (functioning  somewhat  as  a 
bidirectional  associative  memory  or  BAM  might).  However  the  former  model  also 
needs  the  activity  of  appropriately  connected,  and  active,  thalamic  neurons  to  feed 
the  threshold  disinhibition  from  thalamus  up  to  the  relevant  cortical  cells.  Thus 
there  appears  to  be  some  sort  of  a  reconciliation  of  these  two  proposed  modes  of 
action  of  the  cortico-thalamic-  basal  ganglia  system.  It  is  clear  that  further  analysis. 
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and  much  simulation,  needs  to  be  performed  in  order  to  determine  the  effectiveness 
of  the  ACTION  network  in  this  case. 

8  Modules  for  the  Mind 

There  are  clearly  many  modules  involved  in  the  creation  of  consciousness.  An  im¬ 
portant  question  is  if  there  is  a  distributed  action  of  the  brain  to  achieve  conscious 
awareness  or  if  there  are  discrete  centres  which  are  to  be  regarded  as  the  true  sites 
of  such  experience  which  is  then  shared  around  or  broadcast  to  other  such  sites  so 
as  to  give  a  unified  system  of  response.  There  is  some  support  for  the  latter  form 
of  neural  system  for  consciousness,  since  inputs  are  known  to  be  processed  up  to  a 
semantic  level  without  reaching  awareness.  This  is  associated  with  the  phenomenon 
of  subliminal  processing.  It  is  now  reasonably  well  established  that  awareness  and 
response  to  a  later  word  or  other  input  may  be  modified  by  earlier  inputs  which 
have  not  gained  access  to  the  awareness  system  but  which  have  been  processed  up 
to  a  reasonably  high  level.  The  highest  level,  as  noted  above,  was  up  to  semantic 
memory,  but  only  for  words  and  not  for  sentences.  It  is  therefore  clear  that  there 
are  quite  high  level  modules  in  the  brain  whose  activation  does  not  directly  produce 
consciousness  but  which  are  important  in  the  influence  their  activity  may  exert  on 
later  conscious  experiences. 

Such  a  modulating  influence  has  been  modelled  [19]  by  means  of  a  coupled  set 
of  semantic/  working  memory  modules.  The  first  of  these  acts  in  a  feedforward 
excitatory  manner  with  dedicated  nodes  (which  might  arise  from  some  form  of 
topographic  map)  which  feed  to  similar  nodes  on  a  working  memory  module  with 
longer  time  constants.  This  latter  net  acts  as  a  buffer  so  as  to  hold  activity  for 
a  second  or  so,  in  order  that  earlier  context  can  be  properly  incorporated  in  any 
decision  that  is  taken  on  the  working  memory  module.  This  latter  has  inhibitory 
connections  between  incompatible  nodes,  so  that  their  activity  may  effect  the  time 
to  reach  a  certain  threshold  of  nodes  activated  by  later  inputs  from  the  semantic  net. 
It  is  possible  in  this  manner  to  model  with  some  accuracy  the  changes  in  reaction 
time  in  lexical  decision  tasks  brought  about  by  subliminally  processed  letter  strings 
at  an  earlier  time. 

Consciousness  seems  to  reside,  therefore,  on  the  working  memory  modules  of  the 
posterior  cortex.  These  are  present  for  a  variety  of  codes,  and  allow  a  detailed 
implementation  of  the  Relational  Mind  model  developed  by  the  author  over  20  years 
ago,  and  published  more  recently  in  a  more  developed  form  [20].  However  there  are 
various  features  of  consciousness  that  are  missing  from  this  model,  in  particular 
the  emergence  of  its  uniqueness.  Such  a  feature  might  be  expected  to  arise  from 
some  form  of  competitive  action  between  the  different  working  memory  modules. 
Such  an  action  might  be  able  to  occur  in  cortex  by  suitable  connection  of  excitatory 
outputs  from  a  given  working  memory  module  to  others.  However  there  is  no  clear 
picture  of  such  inhibition,  especially  between  the  working  memory  modules  which 
appear  to  be  at  about  the  same  level  in  the  hierarchy  of  brain  modules  following 
cytoarchitectonic  and  myeloarchitectonic  reasoning;  these  working  memory  sites 
may  be  classified  as  heteromodal  cortex. 

In  order  to  resolve  this  issue  of  the  putative  source  of  long  range  inhibition  in  cortex 
(which  is  also  seen  to  be  absent  in  the  modelling  of  global  EEG  patterns)  it  may 
be  that  there  are  sub-cortical  sites  of  such  inhibition  which  are  of  importance.  It 
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has  been  suggested  that  there  are  indeed  such  sites,  in  particular  in  the  nucleus 
reticularis  thalami  (NRT  for  short)  which  is  a  sheet  of  totally  inhibitory  neurons 
which  surround  the  upper  and  lateral  parts  of  the  thalamic  nuclei.  NRT  cells  are 
activated  by  thalamo-cortical  axon  collaterals  piercing  the  NRT,  as  also  by  cortico¬ 
thalamic  axon  collaterals.  A  model  of  this  system  has  been  constructed  [21]  which 
has  been  shown  to  lead  to  global  competition,  and  even  to  allow  explanation  of  the 
relatively  long  time  to  reach  consciousness  determined  by  the  beautiful  experiments 
of  Libet  and  his  colleagues  [22] . 

The  picture  that  emerges  from  these  analyses  is  of  a  set  of  coupled  semantic  (SM) / 
working  (WM)  memories,  each  for  a  given  code  A.  There  is  a  competition  between 
the  activities  of  the  WMs,  the  winner  being  that  whose  content  emerges  into  con¬ 
sciousness.  However  this  account  only  deals  with  posterior  sites,  so  is  more  correctly 
denoted  Cp,  the  subscript  denoting  both  posterior  and  passive.  The  other  impor¬ 
tant  component  of  consciousness  will  be  denoted  as  Cq  ,  the  subscript  now  denoting 
anterior  and  active.  It  is  that  aspect  of  awareness  which  the  ACTION  network,  in¬ 
troduced  above,  can  also  begin  to  tackle  in  terms  of  the  various  control  operations 
it  can  perform.  Various  aspects  of  this  have  been  discussed  earlier,  as  part  of  the 
consideration  of  active  memory  and  the  comparator  action  it  may  support. 

9  Conclusions 

From  what  has  been  outlined  briefly  above  the  time  seems  opportune  to  mount  a 
program  attempting  to  simulate  the  brain  on  a  large  scale  by  modelling  it  as  a  set 
of  connected  modules,  each  composed  of  simple  neurons  connected  together  by  a 
set  of  well  defined  rules.  The  program  can  be  broken  down  into  separate  tasks: 

■  construct  a  software  simulation  with  about  100  modules,  each  with  ever  more 
complex  neurons, 

■  include  increasingly  detailed  neuroanatomical  analyses  so  as  to  include  increas¬ 
ing  realism, 

■  develop  the  simulation  so  as  to  allow  for  a  modular  analysis  of  psychological 
functions  and  their  component  sub-processes, 

■  test  the  simulation  results  against  non-invasive  instrument  measurements, 

■  perform  mathematical  analyses  of  the  resulting  equations  using  techniques  from 
dynamical  systems  and  statistical  mechanics. 

The  resulting  set  of  models  will  be  able  to  provide  an  increasingly  detailed  basis  to 
explore  the  manner  in  which  the  brain  can  support  the  mind. 
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Reggia  [10]  explored  connectionist  models  which  employed  “virtual”  lateral  inhibition,  and  in¬ 
cluded  the  activation  of  the  receiving  node  in  the  equations  for  the  flow  of  activation.  Ahuja  [l] 
extended  these  concepts  to  include  summing  the  total  excitatory  and  inhibitory  flow  into  a  node. 
He  thus  introduced  the  concept  that  the  change  of  activation  of  a  node  depended  on  the  integral 
of  the  flow  into  that  node  and  not  just  the  present  activation  levels  of  the  nodes  to  which  it 
is  connected.  Both  Reggia’s  and  Ahuja’s  models  used  probability  data  for  the  weights.  Ahuja’s 
model  Wets  further  extended  by  Alexander  [2],  [3],  [4]  in  the  RX  model  to  allow  both  the  weights 
and  the  activations  of  Ahuja’s  model  to  be  negative,  and  further,  Alexander’s  model  included 
the  prior  probabilities  of  all  nodes.  Section  1  of  this  paper  contmns  a  complete  listing  of  the  RX 
equations  and  describes  their  development.  The  main  result  of  this  paper,  the  demonstration  of 
the  convergence  of  the  system  is  presented  in  Section  2.  Section  3  briefly  describes  the  experiments 
testing  the  RX  system  and  summarizes  this  article. 

1  The  RX  Equations 

The  net  is  two-layered,  with  the  lower  level  being  the  J  input  nodes  and  upper 
level  the  N  output  nodes.  The  values  of  the  upper  level  nodes,  ai(t),  are  on  [0, 1]. 
The  prior  probability  of  the  existence  of  the  feature  associated  with  the  zth  node 
is  called  a^.  Values  of  ai{t)  greater  than  ai  indicate  a  higher  than  average  chance 
of  occurrence  of  the  feature  represented  by  node  i,  and  those  lower,  indicate  a  less 
than  average  chance.  The  activation  on  the  lower  level  nodes  is  computed  from  the 
range  of  possible  input  values  to  a  node.  If  the  smallest  observed  value  is  Mirij, 
the  largest  Maxj  and  the  average  Avej.  Let  Observj  be  the  observed  value.  Then 
define: 

[Observj  -  Avej] 

ruj  =  - - — : - ^  if  Observj  >  Avej  and 

[Maxj  -  Avej] 

[Observj  ~  Avej]  .  .... 


rrij  —  otherwise  (1) 

■’  [Avej  -  Miuj] 

Clearly  rrij  lies  on  [—1,1]-  When  rrij  assumes  the  value  1,  then  the  feature  repre¬ 
sented  by  rrij  is  present,  with  probability  one.  When  rrij  is  -1  the  feature  is  absent 
with  probability  one.  A  value  of  zero  indicates  that  no  information  exists  concern¬ 
ing  the  absence  or  presence  of  this  feature.  Two  sets  of  weights  exist.  The  first  is 
called  Wij,  and  in  absolute  value  indicates  the  probability  of  occurrence  (if  positive, 
and  non-occurrence  if  negative)  of  the  feature  associated  with  the  upper  level  node 
(ith),  given  the  existence  of  activation  on  the  lower  level  node  (jth).  The  second 
weight  is  called  Vjj ,  and  in  absolute  value  indicates  the  probability  of  occurrence  (if 
positive,  and  non-occurrence  if  negative)  of  the  feature  associated  with  the  lower 
level  node,  given  the  existence  of  activation  on  the  upper  level  node. 

Two  auxiliary  functions  are  to  be  associated  with  each  aiii).  The  function  which 
conveys  the  excitatory  activation  is  called  Exci{t),  and  the  one  which  conveys  the 
inhibitory  is  called  Inhiit).  Each  is  the  sum  of  all  the  excitatory  and  inhibitory 
flow  of  activation  into  node  i.  These  functions  are  defined  by  their  derivatives  and, 
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since  the  latter  are  sums  of  everywhere  positive  terms,  the  former  are  monotone- 
increasing.  Each  of  the  terms  used  in  defining  the  derivative  is  a  function  of  all  the 
ai{t)  and  is  the  variable  forcing  term.  There  will  be  one  such  function  for  each  lower 
level  (j)  node,  hence  a  sum  need  be  taken  over  all  the  j  nodes  connected  to  any 
given  i  node.  The  bounded  characteristic  is  achieved  by  including  in  the  equation 
a  factor  of  the  form  [A^  —  Exci(t)].  The  choice  of  N  as  the  bound  to  which  Exci{t) 
approaches  is  somewhat  arbitrary. 

2  Convergence  of  the  RX  Equations 
2.1  The  RX  Equations 


ai{t)  =  A's  *  Cu  *  [1  —  a^(^)]  *  [Exci{t)  —  Inhi{t)]  (2) 

ai(t)  =  Ks  *  c?  *  [ai(0]  *  [Exq,  ~  Inhi]  (3) 

According  as  ai(t)  >  ai  or  ai{t)  <  aj  respectively.  K3  is  a  constant  of  proportion¬ 
ality,  and  the  values  of  cj  and  Cu  are  chosen  to  keep  the  derivative  continuous  at 
a,-: 

Cu  =  l/(l~ai),  ci  =  l/ai. 


Exci{t)  —  Ki  *  ai  ^[N  —  Exci(t)]  * 

^[^11  *  OUTPPij{t)  -j-  ki2  *  outpmij{t) 

J 

+  ki3  ^  outmpij(t)  ki4  *  OUTMMij(t)] 
Inhi(t)  =  K2  *  *[N  —  Inhi{t)]  * 

'^lk2i  *  outppij{t)  +  k22  *  OUTPMij{t) 
j 

+  ^23  *  OUTMPij{t)  -f  k24  *  outmmij{t)] 
Here  kn,  . . .,  k24  are  included  for  generality. 

OUTPPij{t)  =  - *  m, 

with  similar  expressions  for  the  other  OUTP  terms  and 


outpPij(t) 


Vfcj  *  |afc(^)  -  Ofel 


*  mj  *  \ai(t)  -  Oil 


(4) 

(5) 

(6) 

(7) 


+  . ^ 

with  similar  expressions  for  the  other  outp  terms.  The  ej  is  included  so  that  the 
denominator  never  vanishes. 


2.2  The  Dynamics  of  the  RX  Net 

The  principal  result  of  this  section,  that  the  dynamics  of  the  RX  system  converges, 
is  given  in  Section  2.3  below.  Hirsch  [8]  defines  convergent  dynamics  by  stating 
that  “the  trajectory  of  every  initial  condition  tends  to  some  equilibrium” .  Prior  to 
stating  the  theorem,  we  make  sorne  observations  about  the  critical  points. 

They  are  labeled  as  follows: 

CpQ  =  (ri,  r2,  ..Vi,  ..rjv,  Ni,..Nn,  Ni,  ..Nn)  (8) 

and 

CPl  =  (n,  ^2,  ■•f’iV-L,  Ai,  ..Nm-l,  dAT-L+i.-dAT,  iVi,  ..Ajv-l,  disr.-dAr) 

(9) 


Alexander  &  Coughlin:  Probability  Data  in  Connectionist  Models  63 


where  r*  (the  value  of  a*  )  6  (0, 1),  and  the  di  (the  value  of  Exci  or  Inhi)  G  [0,  N). 
By  letting  L  (the  number  of  the  ai(t)  that  approach  ai)  range  from  0  through  N,  all 
critical  points  may  be  described  simply  by  stating  the  value  of  L,  and  that  of  each 
di.  Note  that  the  critical  points  need  NOT  be  isolated.  One  of  the  lemmas  in  [2] 
has  demonstrated  that  Exci(t)  and  Inhi{t)  are  monotone  increasing  functions.  As 
a  result,  points  of  type  CPl  may  be  thought  of  as  “on  the  way”  to  CPq.  If  Exci(t) 
and  Inhi{t)  approach  N,  then  the  critical  point  becomes  of  type 
The  norm  adopted  in  the  sequel  is  the  £i  norm  [6]  where:  \\X\\  —  |2:j|.  We  proved 

[2]  that  both  Exci(t)  and  Inhi{t)  are  bounded  monotone  increasing  functions  and 
that  the  RX  functions  are  Lipschitzian  throughout  their  domain  of  definition. 

We  now  discuss  the  details  of  the  interface  of  the  upper  and  the  lower  branches 
of  the  equations.  Recall,  the  upper  branch  is  defined  for  a*  G  [uf,  1],  and  the  lower 
branch  for  ai  G  [0,  Oj],  Since  each  branch  is  analytic  on  its  domain  of  definition,  and 
the  composite  (upper  and  lower  branches  considered  as  one)  function  is  Lipschitzian 
where  the  branches  meet  (at  ai  =  Oi),  the  solutions  through  any  point  exist  and 
are  unique.  To  examine  stability,  it  is  necessary  to  look  at  the  derivatives  of  the  RX 
function  and  we  are  confronted  with  the  problem  that  the  derivative,  per  se,  does  not 
exist  on  the  hyper-plane  a*  =  di.  However,  the  left-hand  and  right-hand  derivatives 
do  exist  and  for  the  purpose  of  calculating  stability  (in  our  case  convergence)  of 
the  solution,  we  contend  that  the  above  conditions  are  sufficient  since  convergence 
is  concerned  with  the  behavior  of  the  solutions  passing  near  the  critical  points. 

2.3  Demonstration  of  the  Convergence  of  the  ai{t): 

Theorem  1  Consider  the  equations  (2)  through  (9).  Given: 

0  <  ExCi{tQ)^  Inhi{tQ)  <  N  and  0  <  ai(:^o)  <  Ij  i]  —1  £  raj  <  1,  some  mj  ^  0. 

(10) 

Then,  under  the  above  initial  conditions,  all  trajectories  of  the  RX  system  tend  to 
some  equilibrium  point. 

Proof  From  the  definition,  Exciit)  and  Inhi{t)  can  be  shown  to  be  bounded  mono¬ 
tone  functions  [2],  and  therefore  approach  limits,  LEi  and  Lli,  respectively  (see 
Buck  [7],  pg  26).  The  cases  that  may  occur  are: 

(1)  LEi  ^  Lli; 

(2a)  LEi  =  Lli  =  di,  (di  <  N)  and; 

(26)  LEi  =  LIi  =  N. 

Case  (1)  By  examining  separately  the  cases  where  LEi  >  Lli  and  Lli  >  LEi,  it  is 
easily  shown  that  ai(t)  approaches  a  limit  in  either  case.  (See  [2].) 

The  remaining  two  cases  are,  (2a)  in  which  LEi  =  Lli  —  di  <  N ,  and  (2b)  in  which 
LEi  =  Lli  =  N.  For  the  space  of  any  triple,  ai,  ExCi,  and  Inhi,  a  plane  is  formed 
by  ExCi  =  Inhi.  This  is  pictured  in  Figure  1.  The  line  formed  by  the  intersection 
of  the  plane  with  the  plane  ai  =  di  constitutes  the  ith  component  of  a  CPl  type 
critical  point.  Thus,  this  line  forms  a  continuum  of  critical  points. 

The  flow,  in  both  Cases  (2a)  and  (2b),  is  considered  (1)  outside  a  2t]  band  (77 
positive  and  small)  about  the  plane  ai  =  di  and  (2)  inside  this  band.  It  is  not 
difficult  to  analyze  what  happens  outside  the  band,  but  pure  analysis  proofs  could 
not  be  obtained  inside  the  band.  Hence,  within  the  2t]  band  of  the  di  plane,  the 
associated  linear  system  was  used  to  analyze  the  flow  of  activation. 


64 


Chapter  7 


01  =  01+7]. 

REGION  2 

Of  =  Clfr-’  . , 

REGION  1 

Oi  =  Oi-ri>:  : 


Figure  1  Details  of  the  volume  V. 

Case  (2a)  LEi  =  LIi  =  di,  where  Q  <  di  <  N. 

(1)  Flow  outside  the  2ri  band.  Equations  (8)  furnished  the  result  of  formally  inte¬ 
grating  equation  (4)  and  equation  (5).  The  results  were; 

Exci{t)  =  (11) 

Inhi{t)  =  N —  [N  —  Inhi(to)]*  e  *^20,  ouiiCs)* 

Assume  that  after  some  time  T,  ai(i)  remains  bounded  away  from  the  ai  plane,  by 
some  arbitrarily  small  amount  rj.  Therefore: 

poo  poo  poo 

Ii  =  I  OUTi{t^dt~  I  gi\(ii  —  ai\  dt  ">  I  (13) 

Jq  Jo  Jo 

pOO  pOO  poo 

l2=  out{t)dt=  /  g2\ai-ai\dt>  /  Mmin^dt.  (14) 

Jo  Jo  JO 

In  equation  (9),  gi  is  Wij/{Y^j  Wjj  |a/  -  ajj  -b  e^),  and  g2  is 

Vkj  ki  -  «/l)/(X]  +  fi) 

kjii  I 

Not  all  the  ak{t)  may  approach  their  ajt.  Were  this  to  occur,  then  I2  would  ap¬ 
proach  a  limit  different  from  I\.  Therefore,  LEi  would  be  greater  than  LL  and 
contrary  to  the  assumption  {LEi  =  Lli  —  di)  would  not  converge  to  di,  contrary 
to  hypothesis. 

Under  the  assumption  that  ^kj\(ik{i)  —  «Jfc|  ^  0  we  assert  that,  ^2  =  ^{ji) 

and  gi  =  0{g2)  {O  stands  for  “order  of”).  That  is,  gi  and  g2  are  of  the  same 
order  of  magnitude  (Olmstead  [9],  pl41.)  since  they  are  both  ratios  of  fractions 
whose  denominators  are  not  zero.  The  integrals  /i,  and  I2  will  therefore  diverge 
and,  Exci{t)  and  Inhi{t)  will  approach  N  and  not  di  contrary  to  assumption. 

(2)  The  flow  inside  the  2r}  band.  Elsewhere  [2],  the  eigenvalues  and  eigenvectors 
of  the  Jacobians  on  each  side  of  the  ai{t)  =  ai  hyperplane  are  calculated,  and 
solutions  of  the  associated  linear  system  discussed  in  detail.  It  suffices  to  say  that 
on  one  side  of  the  hyperplane  ai{t)  =  ai,  the  eigenvalues  are  imaginary,  and  cause 
elliptical  motion  about  the  line,  Exci{t)  —  Inhi{i)  (at  di),  which  is  in  the  hyperplane 
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ai{t)  =  Ui.  In  Figure  1  the  imaginary  eigenvalues  from  the  linear  system  characterize 
the  motion  above  the  Oj  plane,  and  the  real  eigenvalues  that  below  this  plane.  From 
below  the  a*  plane  trajectories  in  the  region  where  Exci  >  Inhi  (called  Region  1  in 
Figure  1)  approach  the  a*  plane  from  below,  and  penetrate  it.  Above  the  ai  plane 
(Region  2  where  ExCi  >  Inhi)  trajectories  travel  an  elliptic  path  about  the  line 
Exci  =  Inhi  and  enter  Region  4,  (where  Inhi  >  Exci).  Two  situations  may  occur. 
First,  they  may  leave  the  rj  band  and  go  far  enough  above  it  that  they  may  not 
return.  It  is  known  that  the  integrals  diverge  in  this  region  and  hence  Exci{t)  and 
Inhi{t)  do  not  converge  to  d*.  Second,  the  trajectories  may  move  back  towards  the 
hyperplane  ai{t)  =  ai.  They  penetrate  this  plane  and  pass  through  to  the  other  side. 
Once  beneath  the  ai  plane  they  are  in  Region  3,  where  a*  <  Uj,  and  Inhi  >  ExCi. 
The  trajectory  then  moves  away  from  the  critical  point.  Since  both  Exci  and  Inhi 
are  monotone  non-decreasing,  once  past  d*,  they  cannot  approach  this  point  again. 
Therefore,  unless  trajectories  actually  encounter  a  CPl  type  critical  point,  they  are 
driven  away  from  it  and  into  the  region  where  \ai  —  ai\>r].  The  integrals  7i  and 
I2  therefore  increase,  and  hence  Exci  and  Inhi,  being  monotone  increasing,  will 
exceed  dj  and  not  approach  it  as  a  limit. 

Below  the  ai  plane,  the  eigenvalues  are  real,  but  one  of  them  is  positive.  Bellman, 
[6]  (Theorem  3,  p88)  has  shown  that,  under  conditions  which  the  RX  equations 
obey,  (i.e.  differentiability  Rudin  [11],  pg  189)  critical  points  possessing  positive 
eigenvalues  are  unstable.  As  such,  CPl  type  critical  points  are  unstable  equilibrium 
points.  Instability  is  not  a  strong  property,  and  therefore,  the  possibility  of  points 
approaching  a  CPl  type  critical  point  cannot  be  dismissed.  Elsewhere  [2]  it  is 
demonstrated  that  a  solution  of  the  linear  system  does  indeed  approach  a^. 

We  now  move  on  to  the  last  remaining  possibility  for  Exci(t)  and  Inhi{t):  LEi  = 
Lli  =  N.  {Case  (2b)} 

(1)  Flow  outside  the  2r)  band.  In  this  case  as  in  the  preceding  case  it  is  fairly  simple 
to  show  that  ai(t)  approaches  a  limit. 

(2)  Flow  inside  the  2r}  band.  Flow  inside  the  band  is  approximated  by  the  linear 

system,  but  all  trajectories  are  now  those  for  which  Exci(t)  and  Inhi{t)  are  within 
6  of  N.  The  approximating  linear  equations  are  presented  in  [2].  As  before,  the  tra¬ 
jectories  will  eventually  reach  Region  3,  where,  although  the  constants  multiplying 
it  are  small,  the  motion  is  still  governed  by  a  positive  exponential  and  ai{t)  will  be 
directed  away  from  a^.  Thus,  it  would  appear  that  when  ai{t)  leaves  the  cube  of 
figure  1,  it  will  approach  a  limit.  □ 

We  conclude  by  proving  a  lemma. 

Lemma  2  //limf_.oo  Exci{t)  =  Inhi{t)  =  N,  then  ai(t)  will  not  attain  ai  in  finite 
time. 

Proof  The  proof  is  by  contradiction.  Assume  for  t  >  t*  that  ai(t)  =  ai.  Then,  for 
t  >  t* ,  OUTiit)  —  outi{t)  =  0.  Therefore: 

/•OO  ft*  rOO  rt* 

/  OUTi{t)  dt=  OUTi{i)  dt^a  /  outi{t)  dt^  outi{t)  dt  = 

Jo  Jo  Jo  Jo 

(15) 

where  a  and  j3  are  finite  numbers.  Integrating  (4)  we  have  for  a  solution  to  Exci{t)\ 
\im  Exci{t)  =  N  —  [N  —  Exci{to)]e~^^^'  ouTdt)dt 

<—*■00 
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=  N  ~[N-  Exci{to)]e-^^^^^  ^  N. 

The  same  can  be  shown  for  Inhi(t).  □ 

It  would  appear  from  this  lemma,  and  from  the  fact  that  ai{t)  is  very  near  the  ai 
plane,  that  the  trajectories  within  the  volume  being  considered  are  moving  very 
slowly.  This  is  to  be  expected  since  all  terms  are  small. 

We  have  thus  shown  in  all  three  cases  (1),  (2a)  and  (2b)  that  all  trajectories  of 
the  RX  system  approach  limits  and  hence  the  demonstration  of  the  convergence  of 
ai(^)  is  complete. 

3  Summary 

The  RX  equations  have  been  tested  both  in  associative  memory  and  control  appli¬ 
cations.  In  associative  memory  abilities,  a  back-propagating  net’s  performance  only 
slightly  exceeded  that  of  the  RX  net  in  a  test  involving  classifying  RADAR  signals. 
In  control  applications  a  two  input,  one  output  node  RX  net  (three  neuron  con¬ 
troller)  performed  as  well  or  better  than  a  fuzzy  controller  in  the  task  of  backing  a 
truck  to  a  loading  dock.  Because  of  the  apparent  biological  plausibility  of  the  three 
neuron  controller,  and  the  existence  of  elementary  life  forms  with  control  circuitry 
yet  no  memory  circuitry,  we  offer  the  speculation  that  neural  memory  circuitry  has 
evolved  from  neural  control  circuitry  [5].  Recall,  that  the  RX  net  uses  probability 
data  as  weights  and  that  it  involves  no  learning.  In  the  future,  we  intend  to  study 
more  biologically  plausible  versions  of  the  RX  net  both  in  memory  and  control 
applications. 
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This  paper  proposes  a  weighted  mixture  of  locally  generahzing  models  in  an  attempt  to  resolve  the 
trade- off  between  model  mismatch  and  measurement  noise  given  a  sparse  set  of  training  samples  so 
that  the  conditional  mean  estimation  performance  of  the  desired  response  can  be  made  adequate 
over  the  input  region  of  interest.  In  this  architecture,  each  expert  model  has  its  corresponding 
variance  model  for  estimating  the  expert’s  modeling  performance.  Parameters  associated  with 
individual  expert  models  are  adapted  in  the  usual  least-mean-square  sense,  weighted  by  its  variance 
model  output.  Whereas  the  variance  models  are  adapted  in  such  a  way  that  expert  models  of 
higher-resolution  (or  greater  modeling  capability)  are  discouraged  to  contribute,  except  when  the 
local  modeling  performsuice  becomes  inadequate. 

1  Background 

Artificial  neural  networks  have  been  widely  used  in  modeling  and  classification  ap¬ 
plications  because  of  their  flexible  nonlinear  modeling  capabilities  over  the  input 
region  of  interest  [3].  The  merit  of  individual  networks  is  generally  evaluated  in 
terms  of  cost  criterion  and  learning  algorithm  by  which  the  free  parameters  are 
adapted,  and  the  characteristics  of  the  parametric  model.  These  networks  often 
employ  minimum  sum-squared  error  criteria  for  parameter  estimation  such  that 
the  desired  solution  is  a  conditional  mean  estimator  for  the  given  data  set,  pro¬ 
vided  that  the  model  is  sufficiently  flexible  [2].  The  use  of  such  criterion  not  only 
provides  a  simple  iterative  procedure  for  parameter  estimation  but  also  the  adapta¬ 
tion  follows  the  Maximum  Likelihood  (ML)  principle  when  the  model  uncertainty  is 
a  realization  of  independent  Gaussian  process.  Nevertheless  in  many  applications, 
the  generalization  performance  which  utilizes  the  least-square  criterion  breaks  down 
when  the  conditional  mean  response  is  estimated  using  a  small  set  of  data  samples 
with  an  unknown  degree  of  noise  uncertainty.  This  is  because  the  LS  criterion  max¬ 
imizes  only  the  overall  fitting  performance  defined  by  the  data  set,  rather  than 
evaluating  the  model  mismatch  in  any  local  input  region.  Thus,  a  sufficiently  flexi¬ 
ble  model  often  overfits  undesirable  noisy  components  in  the  data  samples  whereas 
a  model  with  restricted  degrees  of  freedom  is  less  likely  to  converge  to  the  condi¬ 
tional  mean  of  the  data  response  [4] .  This  paper  focuses  on  an  alternative  approach 
in  dealing  with  the  bias/variance  dilemma,  which  is  based  on  a  variation  of  the 
“Mixture-of-Experts”  (or  ME)  [5]. 

2  Weighted  Mixture  of  Experts  (WME) 

One  alternative  solution  to  the  bias/variance  dilemma  is  to  incorporate  a  weighted 
mixture  of  expert  (or  WME)  models  so  that  only  few  experts  contribute  in  any  par¬ 
ticular  input  region  [5].  The  internal  structure  in  each  of  these  experts  is  fixed,  and 
a  separate  variance  model  is  used,  one  for  each  expert,  to  evaluate  its  corresponding 
expert’s  performance.  If  the  expert  and  variance  models  are  chosen  to  be  linear  with 
respect  to  their  adaptable  parameters,  the  resulting  learning  process  can  be  formu¬ 
lated  as  a  linear  optimization  problem  which  enables  rapid  learning  convergence. 
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The  learning  process  for  the  WME  algorithm  is  based  on  the  competitive  nature 
among  experts  in  modeling  an  unknown  mapping,  and  the  overall  WME  output, 
y{x),  is  simply  a  linear  combination  of  individual  expert  model  outputs  weighted 
by  their  variance  model  outputs,  aj{x),  at  x 


y{^) 


m 


m 


exp(— ^  ,  ,  .  .  ,  ,  ,  , 


(1) 


where  yj{x)  is  the  jih  expert  model  output,  and  aj  is  the  corresponding  variance 
model  output  (associated  with  the  jth  expert  output).  pj{x)  can  be  interpreted  as  a 
local  fitness  measure  of  the  jth  expert  modeling  performance  at  x,  and  is  bounded 
e  [0,  1].  The  error  criterion  for  individual  expert  models  is  defined  as 


1  " 
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whereas  the  cost  criterion  for  adapting  the  variance  models  is 


E, 


1  " 


n  ^ 
1=1 


y]  \jPj{xi)  (y(xi)  -  yj{xi))' 

;=i 


(2) 

(3) 


where  n  is  the  number  of  data  samples,  and  y{xi)  is  the  data  output  for  Xi.  Ey 
ensures  that  each  of  the  experts  is  allowed  to  best  approximate  y{xi)^  weighted  by 
Pj(xi),  whereas  E^  ensures  that  all  experts  are  specialized  in  unique  input  regions 
by  assigning  a  smaller  pj{i)  for  a  larger  error  variance,  and  vice  versa.  The  term 
Aj  in  (3)  regulates  the  variance  estimation  among  models  of  varying  resolution  so 
that  Xj  is  larger  for  higher-resolution  experts,  and  vice  versa.  If  this  term  was  not 
included,  the  resulting  adaptation  would  lead  to  a  WME  model  consisting  mainly 
of  high-resolution  experts.  Xj  can  thus  be  interpreted  as  a  form  of  regularization 
which  takes  in  prior  knowledge  about  the  set  of  experts  for  modeling,  and  can  be 
set  inversely  proportional  to  their  degrees  of  freedom  (e.g.  total  number  of  free 
parameters).  This  criterion  is  thus  different  from  that  proposed  in  [5]  as  a  result 
of  this  regularizing  factor  A.  In  the  case  where  the  model  structure  of  any  expert 
is  non-uniform  across  the  input  region  of  interest  (such  as  clustering-based  radial- 
basis-functions  network),  Xj  can  be  modified  in  such  a  way  that  it  varies  from  one 
input  segment  to  another. 

Like  in  many  artificial  neural  networks,  the  iterative  procedure  can  be  modified  so 
that  the  adaptation  is  carried  out  sample  by  sample,  and  the  corresponding  cost  cri¬ 
terion  can  be  reduced  by  dropping  the  first  summation  term  in  (2,3).  It  is  generally 
desirable  to  maintain  equal  variance  model  outputs  prior  to  on-line  learning  to  avoid 
bias  toward  any  particular  model  structure,  unless  prior  knowledge  of  the  unknown 
mapping  is  available  (e.g.  aj(-)  =  0).  As  training  proceeds,  the  variance  models 
are  adapted  to  form  appropriate  prior  knowledge  for  combining  the  experts.  These 
parameterized  models  can  be  chosen  from  a  variety  of  existing  locally-generalizing 
models,  such  as  radial-basis-functions,  B-Splines  or  Cerebellar  Model  Articulation 
Controller  (or  CMAC)  networks.  These  networks  are  particularly  well  suited  to  the 
WME  model  implementation  because  their  internal  structures  are  fixed  and  their 
parameters  are  defined  over  a  local  region  of  support,  thus  enabling  rapid  learning 
convergence.  Also,  the  influence  of  the  variance  model  is  restricted  to  within  the 
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local  extent  of  the  training  input,  and  equal  contribution  will  be  assigned  to  regions 
of  unseen  data  samples.  If  both  variance  and  expert  models  generalize  globally,  the 
conditional  mean  estimation  performance  is  likely  to  deteriorate  considerably  as  a 
result  of  temporal  instability. 

3  Example 


1  2  3  4  5  6  7  8  9  10  1  2  3  4  5  6  7  8  9  10 


Figure  1  (a)  Top:  The  RMS  curves  as  Figure  2  (a)  Similar  configuration  to 

a  function  of  training  cycle  number  for  those  described  in  Figure  1  except  that 

five  different  models  averaging  over  10  each  independent  training  set  contains 

independent  sets  of  500  noiseless  (ran-  5000  saimples.  (b)  bottom:  seime  RMS 

domly  drawn)  training  samples.  single  curves  as  described  in  (a)  except  that 

CMAC  (C  =  2);  single  CM  AC  (C  =  the  samples  are  conteiminated  with  zero- 

7);  additive  CMACs  (C  =  2,  7);  ’o’:  mean  Gaussian  noise  of  v^lri£mce  0.1. 

WME  CMACs  (1:1  A  ratio,  C  =  2,  7);  ’x’: 

WME  CMACs  (1:3  A  ratio,  C  =  2,  7).  (b) 
bottom:  same  RMS  curves  as  described  in 
(a)  except  that  the  samples  are  contam¬ 
inated  with  zero-meem  Gaussiein  noise  of 
veiriance  0.1. 

In  this  section,  the  generalization  performance  and  implementation  of  the  WME 
model  are  illustrated  with  an  example  using  a  sigmoidal  function  as  desired  surface 
y{xi ,  2:2)  =  1  +  1/(1  +  exp[-100(xi  +  2:2)])  (4) 

This  surface  has  a  significant  gradient  along  the  ridge  on  the  anti-diagonal  but  zero 
elsewhere,  and  thus  the  WME  model  is  particularly  well  suited  to  the  surface  by 
fitting  high-resolution  experts  near  the  anti-diagonal  and  low-resolution  experts  to 
regions  with  zero  gradient.  In  this  example,  the  WME  model  is  based  on  two  in¬ 
dependent  expert  models,  each  parameterized  as  a  form  of  CMAC.  As  a  form  of 
a  generalized  look-up-table,  the  CMAC  output  is  a  linear  combination  of  a  set  of 
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submoders  output  weighted  by  a  set  of  adaptable  parameters.  Each  of  the  submod¬ 
els  is  defined  as  a  union  of  non-overlapping  basis  functions  of  hypercubic  support 
on  the  input  lattice  (with  the  support  spanning  C  number  of  cells  along  each  in¬ 
put  axis),  and  these  submodels  are  offset  relatively  along  the  input  diagonal  in  the 
standard  CM  AC  where  the  generalization  power  is  dominant.  As  the  number  of 
submodels  is  also  constrained  to  be  C,  which  is  generally  greater  than  1,  the  entire 
set  of  basis  functions  associated  with  all  submodels  are  thus  sparsely  distributed  on 
the  input  lattice.  The  individual  basis  function  for  any  input  can  be  either  binary 
or  continuous  (tapered  basis  function)  so  that  its  response  is  proportional  to  some 
chosen  metric  norm  between  the  input  and  the  support  center.  In  addition,  the 
layout  of  the  submodels  ensures  that  exactly  C  parameters  are  updated  at  every 
iteration  for  a  given  training  sample.  One  important  property  of  the  CMAC  net¬ 
work  is  that  a  larger  C  leads  to  a  larger  degree  of  generalization  but  with  fewer 
adaptable  parameters,  and  vice  versa.  Thus,  the  trade-off  between  the  computa¬ 
tional  cost  and  modeling  capability  must  be  balanced  with  great  care.  The  CMAC 
parameters  are  commonly  adapted  using  the  normalized  LMS  algorithm  in  order 
to  facilitate  fast  convergence.  Further  detail  of  the  CMAC  modeling  capability  and 
learning  convergence  can  be  found  in  [3]. 

In  the  sigmoidal  example,  the  input  domains  were  restricted  to  [—1.5, 1.5]^  and 
their  submodel  offsets  were  uniform  along  univariate  axes  (see  [1]  for  the  improved 
offset  scheme).  The  nonlinear  input  transformation  function  associated  with  each 
of  the  parameters  was  chosen  to  be  linearly  tapered  along  each  input  axis,  and  the 
function  output  was  formed  using  the  standard  product  operator.  The  individual 
expert  model  output  was  self-normalized  in  order  to  produce  a  smoother  response 
by  minimizing  bias  toward  any  particular  input  region.  In  both  expert  models, 
each  input  axis  was  partitioned  into  20  intervals,  and  the  learning  rate  was  initially 
set  to  1  (decreased  slowly  to  zero).  The  generalization  parameters,  Ci  and  C2,  for 
both  expert  models  (identified  as  yi  and  2/2)  were  chosen  to  be  7  and  2.  Thus, 
the  basis  support  of  each  cell  was  seven  intervals  wide  in  1/1,  but  only  two  in  t/2, 
and  there  were  186  and  481  adaptable  parameters  in  yi  and  2/2  respectively.  Ai 
and  A2  were  chosen  to  be  1  and  3  respectively  because  of  the  approximate  ratio 
of  free  parameters.  The  variance  models,  ai  and  a2,  were  parameterized  using  the 
same  internal  structure  and  learning  rule  as  those  of  the  2/1  and  2/2  respectively,  but 
with  independent  parameter  vectors,  and  the  model  parameters  were  adapted  using 
least-mean-squares  method.  In  addition  to  the  WME  model  (Ml),  four  independent 
models  were  considered  as  control  comparisons;  M2:  single  CMAC  (C  ==  2);  M3: 
single  CMAC  (C  =  7);  M4:  additive  CMACs  (C  =  2,  7)  where  pi  and  p2  were  each 
set  to  0.5,  and  finally  M5:  WME  but  with  Ai  =  A2  =  1.  That  is,  M2  —  M4  shared 
identical  learning  rules  and  input  partitioning  with  Ml,  except  the  generalization 
width,  whereas  Ml  and  M5  were  only  different  in  Ai  and  A2. 

Case  1:  RMS  Study  for  Sparse  Training  Samples 

In  this  case  study,  10  independent  sets  of  500  data  samples  {51, 52,  •  •  • ,  510}  were 
generated  by  uniformly  randomly  interrogating  the  sigmoidal  function  within  the 
hypercubic  input  region  G  [—1,1]^.  Each  of  the  five  models  was  independently 
adapted  using  individual  sample  sequentially  drawn  from  51,  and  the  training  was 
carried  out  over  ten  sweeps  of  51.  Parameters  of  these  models  were  then  nullified, 
and  the  same  run  was  repeated  for  the  rest  of  the  data  sets.  An  expected  root- 
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mean-square  (RMS)  generalization  error  curve  as  a  function  of  training  cycle  was 
then  generated  using  3000  independent  testing  samples  for  the  five  models  by  av¬ 
eraging  the  individual  RMS  performance  over  the  ten  data  sets.  Figure  la  shows 
the  corresponding  RMS  curves  for  these  models,  and  Figure  lb  shows  the  similar 
curves  except  that  the  same  set  of  training  samples  were  contaminated  with  addi¬ 
tive  Gaussian  noise  of  variance  0.1.  It  is  interesting  to  observe  that  Ml  and  Mb 
performed  significantly  better  than  the  single  and  additive  CMACs  (of  fixed-pi,2), 
indicating  the  importance  of  estimating  the  variance  from  the  data  samples.  Also, 
Ml  provides  a  better  RMS  fit  as  compared  to  M5,  suggesting  that  the  use  of  A 
can  be  tuned  to  provide  optimal  resource  sharing.  Similar  RMS  curves  for  the  first 
500  training  samples  averaged  over  the  ten  runs  also  indicated  similar  profiles  as 
compared  to  those  in  Figure  1,  suggesting  that  the  WME  model  consistently  per¬ 
formed  better  than  the  other  four  models  for  on-line  learning  ^ .  More  work  needs 
to  be  investigated  on  the  choice  of  A. 

Case  2:  RMS  Study  for  Abundant  Training  Samples 
In  this  case  study,  10  independent  sets  of  5000  data  samples  were  instead  generated 
using  the  identical  method  described  above,  and  the  same  procedure  was  carried 
out  to  form  the  expected  RMS  curves  for  the  five  models.  Figure  2a  shows  the 
corresponding  curves  using  noiseless  samples  whereas  Figure  2b  shows  the  similar 
curves  using  noisy  samples.  It  can  be  observed  that  except  M3  which  had  almost 
the  same  RMS  characteristics  as  those  shown  in  case  1,  the  modeling  performances 
among  the  four  models  were  relatively  closer,  suggesting  that  for  more  abundant 
training  samples  and  sufficient  modeling  capability,  the  process  for  estimating  the 
variance  becomes  less  critical,  as  expected.  Note  that  for  all  the  RMS  curves,  noise 
overfitting  problem  was  not  observed  because  of  the  diminishing  learning  rate  for 
adaptation,  which  can  be  interpreted  as  a  form  of  regularization,  and  is  conceptually 
similar  to  early-stop  training  approach  [6] . 

Case  3:  Surface  Evaluation 

To  visualize  the  modeling  performance,  Figure  3  shows  the  individual  model  outputs 
evaluated  at  every  point  of  a  50  X  50  grid  after  10  training  cycles  (1  run)  using  a 
single  set  of  500  noisy  training  samples.  The  rough  surface  of  M2  was  due  to  the 
smaller  local  support  (less  noise  averaging)  and  sparse  training  samples  (insufficient 
generalization),  as  opposed  to  that  of  M3.  By  estimating  the  variance  over  the  input 
region  of  interest,  the  surface  reconstruction  was  comparatively  better,  as  compared 
to  that  of  M4.  Also,  the  relative  weighting  surfaces  of  pi  and  p2  of  Ml  were  found 
to  follow  closely  the  relative  gradient  of  the  sigmoidal  surface,  as  desired. 

4  Conclusion 

The  validity  of  the  WME  model  is  based  on  the  assumption  that  the  conditional 
mean  estimation  is  the  desired  learning  objective,  and  the  proposed  architecture 
makes  use  of  variance  models  which  estimate  the  error  variance  for  their  expert 
models,  and  assign  less  weighting  to  those  expert  models  with  larger  variance.  The 
estimation  performance  is  adapted  using  a  given  set  of  training  data,  and  the  set 
of  variance  models  evolves  to  form  the  desired  prior  in  choosing  an  appropriate 
set  of  expert  models.  It  is  intuitive  that  in  order  for  each  and  every  expert  to  ex¬ 
cel  in  different  input  regions,  their  model  structures  must  be  unique  otherwise  the 


^Resxilts  were  omitted  for  space  reason. 
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Figure  3  Surface  reconstruction  after  10  training  cycles  of  iterations  using  four 
different  models  and  a  single  set  of  500  noisy  training  seumples.  (a)  top  left:  single 
CMAC  (C  =  2);  (b)  top  right:  single  CMAC  (C  =  7);  (c)  bottom  left:  additive 
CMACs  (C  =  2,  7);  (d)  bottom  right:  WME  CMACs  (1:3  A  ratio,  0  =  2,  7). 


bias/variance  dilemma  issue  will  remain  unresolved.  Having  unique  internal  expert 
structures  also  facilitates  implicit  model  selection  problems  such  that  one  does  not 
need  to  search  for  an  optimal  structure  for  a  single  model.  It  would  appear  that 
having  a  mixture  of  experts  increases  the  total  number  of  free  parameters  for  opti¬ 
mization,  and  the  resulting  network  will  not  be  parsimonous.  Quite  differently,  the 
objective  is  not  to  reduce  the  number  of  physical  parameters,  as  advocated  in  many 
proposed  architectures,  but  rather  to  reduce  the  number  of  effective  parameters, 
governed  by  the  variance  models,  in  such  a  way  that  the  WME  model  can  be  more 
robust  with  respect  to  noise  and  model  mismatch. 
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The  form  of  a  radial  basis  network  is  a  linear  combination  of  translates  of  a  given  radial  basis 
function,  (^(r).  The  radial  basis  method  involves  determining  the  values  of  the  imknown  parameters 
within  the  network  given  a  set  of  inputs,  {x^},  and  their  corresponding  outputs,  {/fe}.  It  is  usual 
for  some  of  the  parameters  of  the  network  to  be  fixed.  K  the  positions  of  the  centres  of  the  basis 
functions  are  known  and  constmt,  the  radial  basis  problem  reduces  to  a  steindard  linear  system  of 
equations  and  many  techniques  are  available  for  ceJculating  the  values  of  the  unknown  coefficients 
efficiently.  However,  if  both  the  positions  of  the  centres  and  the  values  of  the  coefficients  are 
allowed  to  vary,  the  problem  becomes  considerably  more  difficult.  A  highly  non-linear  problem  is 
produced  and  solved  in  an  iterative  manner.  An  initial  guess  for  the  best  positions  of  the  centres 
is  made  and  the  coefficients  for  this  particular  choice  of  centres  are  calculated  as  before.  For  each 
iteration,  a  small  change  to  the  position  of  the  centres  is  made  in  order  to  improve  the  quality 
of  the  network  and  the  values  of  the  coefficients  for  these  new  centre  positions  ^lre  determined. 
The  overall  algorithm  is  computationally  expensive  md  here  we  consider  ways  of  improving  the 
efficiency  of  the  method  by  exploiting  the  local  stability  of  the  thin  plate  spline  basis  fimction. 
At  each  step  of  the  iteration,  only  a  small  change  is  made  to  the  positions  of  the  centres  and  so 
we  can  reasonably  expect  that  there  is  only  a  small  change  to  the  values  of  the  corresponding 
coefficients.  These  small  changes  are  estimated  using  local  modifications. 

1  Introduction 

We  consider  the  thin  plate  spline  basis  function 

(^(r)  =  logr. 

This  basis  function  has  not  been  as  popular  as  the  Gaussian,  (^(r)  =  exp(— r^/2<T), 
mainly  due  to  its  unbounded  nature.  The  thinking  behind  this  is  that  it  is  desirable 
to  use  a  basis  function  that  has  near  compact  support  so  that  the  approximating 
function  behaves  in  a  local  manner. 

A  suitable  definition  of  local  behaviour  for  an  approximating  function  of  the  form 
given  in  equation  (1)  is  as  follows.  The  ith  coefficient,  c,-,  of  the  approximant  should 
be  related  to  the  values  {/*}  for  values  of  k  where  ||xjb  —  Ai||  is  small.  Thus,  the 
value  of  the  coefficient  should  be  influenced  only  by  the  data  ordinates  whose  ab¬ 
scissae  values  are  close  to  the  centre  of  the  corresponding  basis  function.  From  this 
definition  one  deduces  that  it  is  advisable  to  use  basis  functions  that  decay,  hence 
the  desire  for  basis  functions  with  near  compact  support.  However  many  authors 
have  shown  that  this  deduction  is  unfounded  and  it  is  in  fact  easier  to  produce  the 
properties  of  local  behaviour  by  using  unbounded  functions  [4,  2,  3,  1]. 

1.1  The  Approximation  Problem 

Given  a  set  of  m  data  points,  (xjb,  A),  for  Ar  =  1,  2,  . . . ,  m,  it  is  possible  to  produce 
an  approximation  of  the  form 

/(x)  =  f^c,<t(||x-A(||),  (1) 

i=l 
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Data  and  Csntras 


Figure  1  The  positions  of  the  data  abscissae  and  the  centres  of  the  basis  func¬ 
tions,  and  the  surface  approximeint  to  the  data. 


such  that 

/(Xfc)  «/ik. 

It  is  usual  for  this  approximation  problem  to  be  solved  in  the  least-squares  sense. 
There  are  two  important  categories  for  such  an  approximation.  The  easiest  approach 
is  to  consider  the  positions  of  the  centres  to  be  fixed,  in  which  case  the  approxima¬ 
tion  problem  is  linear  and  is  therefore  reasonably  efficient  to  solve.  The  alternative 
is  to  allow  the  positions  of  the  centres  to  vary  or  indeed  for  the  number  of  centres 
to  change.  For  example,  once  a  preliminary  approximation  has  been  completed  it 
is  worth  examining  the  results  to  determine  the  quality  of  the  approximation.  It 
may  be  decided  that  one  (or  more)  regions  are  unsatisfactory  and  so  extra  basis 
functions  or  an  adjustment  of  the  centres  of  the  current  basis  functions  would  be 
suitable. 

Under  such  circumstances  it  is  usual  to  recalculate  the  new  coefficients  for  the 
approximation  problem  with  respect  to  the  new  positions  of  the  centres.  Repeating 
this  process  too  often  can  be  computationally  expensive  and  it  is  more  appropriate 
to  modify  the  current  approximation  to  take  into  account  the  small  changes  that 
have  been  made  to  the  centres. 

2  Local  Stability  of  the  Thin  Plate  Spline 

This  local  behaviour  of  the  thin  plate  spline  is  demonstrated  by  the  use  of  an 
example  that  consists  of  approximating  m  =  6,864  data  points  using  n  =  320  basis 
functions.  The  centres  for  these  basis  functions  are  produced  using  a  clustering 
algorithm  and  the  data  are  fitted  in  the  least-squares  sense.  Formally,  let  I  be  the 
set  of  indices  for  the  basis  functions,  {1, 2, . . . ,  n}.  We  wish  to  calculate  the  values 
of  the  coefficients  {q}  that  minimize  the  ^2'norm  of  the  residual  vector  e  which 
has  the  components 

ek  =  fk  -^Ci<^(||x- Aill), 

•  €/ 

for  k  =  1,  2,  . . , ,  m.  The  data,  centres  and  the  resulting  approximant  are  shown  in 
Figure  1. 
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Figure  2  The  differences  in  the  coefficients  between  the  two  approximants  as  a 
function  of  distance.  Stand£U'd  and  logarithmic  axes  are  shown. 

Let  j  £  I  he  the  index  for  one  of  the  centres.  We  now  perturb  this  centre  by  a  small 
amount,  6,  such  that  we  produce  a  new  centre; 

A*  =  \j  +  6, 

All  of  the  other  centres  are  left  undisturbed.  That  is  A*  =  A,-  for  i  £  I.i^  j.  Again 
the  data  are  fitted  in  the  least-squares  sense  but  this  time  we  use  the  new  set  of 
centres,  {A*}.  Thus  we  calculate  the  values  of  the  coefficients  c*,  for  each  i  £  I, 
that  minimize  the  ^2-norm  of  the  residual  vector  e*  which  has  the  components 

*€/ 

for  A;  =  1,  2,  . . . ,  m.  The  two  approximants  are  visually  indistinguishable  and  so 
we  compare  the  coefficients  of  the  respective  fits. 

Let  di  be  the  distance  between  the  centre  of  the  ith  basis  function  and  the  position 
of  the  perturbed  centre,  di  =  ||A*  —  A^||.  Figure  2  shows  the  differences  between 
the  respective  coefficients,  {(cj  —  c*)}  of  the  two  fits  against  the  distances  {di}. 
Also  shown  is  the  logarithm  of  the  absolute  values  of  the  differences  between  the 
respective  coefficients  against  the  same  set  of  distances. 

It  can  be  seen  that  the  differences  between  the  coefficients  decay  exponentially  as 
the  distance  between  the  corresponding  centre  and  the  perturbation  increases.  It  is 
in  this  sense  that  we  say  that  the  thin  plate  spline  behaves  locally. 

3  Exploiting  the  Local  Behaviour 

Since  the  effects  of  perturbing  the  position  of  a  given  centre  are  only  noticeable  for 
the  coefficients  which  correspond  to  basis  functions  that  are  centred  in  the  neigh- 
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bourhood  of  the  perturbed  centre,  it  seems  reasonable  that  we  should  be  able  to 
retrain  the  network  with  these  new  centre  positions  by  using  only  the  basis  func¬ 
tions  from  this  neighbourhood.  The  coefficients  for  the  other  basis  functions  can 
be  left  unchanged  on  the  assumption  that  they  will  not  be  affected  by  a  significant 
amount.  By  using  only  a  small  number  of  centres  we  can  expect  this  local  approx¬ 
imation  to  be  significantly  faster  than  the  global  approach  which  uses  all  of  the 
centres. 

We  adopt  an  iterative  refinement  type  approach.  First  we  choose  to  use  a  given 
number,  q,  of  basis  functions  whose  centres  are  nearest  to  the  position  of  the  new 
perturbed  centre.  Let  /*  C  /  be  the  set  of  indices  for  these  q  basis  functions.  Then 
we  calculate  the  residuals  between  the  original  function  values  and  the  current 
estimate  for  the  approximating  function.  This  current  estimate  is  based  on  the  best 
approximant  using  the  original  positions  of  the  centres,  but  with  the  basis  functions 
for  the  new  positions  of  the  centres, 

iei 

We  wish  to  approximate  these  residuals  using  the  approximating  form 

iei* 

This  approximation  is  again  done  in  the  least-squares  sense.  This  produces  a  “cor¬ 
rection  surface”  that  we  combine  with  the  current  estimate  for  the  approximant 
in  order  to  produce  an  updated  estimate  (or  combined  surface)  for  the  required 
approximant.  The  new  approximant  has  the  coefficients 

(new)  _  (  Ci-\-  Sc*  i  e  I* , 

*  I  c,  i^I*. 

The  experiment  is  repeated  for  various  values  of  q  and  the  results  are  discussed 
below. 

3.1  Choice  of  the  Data  Subset 

If  we  use  all  of  the  data  and  therefore  all  of  the  residuals,  we  find  that  the  resulting 
correction  surface  is  flat  in  nature  and  the  local  variation  around  the  position  of 
the  perturbation  has  been  largely  ignored.  The  reason  for  this  phenomenon  is  that 
we  are  treating  each  of  the  data  points  with  equal  importance  and  since  there  are 
many  more  data  outside  the  neighbourhood  of  interest  than  the  number  inside,  the 
former  group  of  data  tend  to  dominate  the  approximation. 

Since  we  are  only  using  the  basis  functions  that  are  local  to  the  position  of  the 
perturbation,  it  seems  reasonable  to  use  only  the  data  that  lie  in  the  neighbour¬ 
hood  of  interest.  Unfortunately  this  too  has  a  problem  in  that  we  are  effectively 
assigning  no  importance  to  the  “far-away”  data  points  and  so  the  approximation 
has  a  tendency  to  behave  in  an  uncontrolled  manner  away  from  the  neighbourhood, 
in  much  the  same  way  that  the  basis  functions  are  uncontrolled. 

As  a  compromise  we  use  a  subset  of  data  that  consists  of  all  of  the  data  points  that 
are  close  to  the  perturbation  and  a  few  data  points  that  are  sampled  randomly  from 
the  remainder  of  the  data  domain.  In  the  example  that  we  are  considering  512  data 
points  were  used  for  the  local  subset  and  512  data  points  were  used  to  represent 
the  remainder  of  the  domain.  By  doing  this  we  wish  to  give  more  importance  to  the 
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local  data  while  still  considering  the  global  etFect  of  the  correction  approximant. 
An  advantage  that  accrues  from  this  approach  is  that  the  local  approximation 
uses  fewer  data  points  than  the  global  method  and  so  we  can  expect  the  speed  of 
calculating  the  approximation  to  be  significantly  faster. 

4  Concluding  Remarks 

Various  values  for  q  were  chosen  and  the  local  modifications  were  calculated.  The 
value  of  the  square  root  of  the  mean  of  the  squares  of  the  errors  (R.M.S.)  between 
all  of  the  original  ordinates  and  the  values  of  the  combined  surface  at  the  data 
points  produced  using  the  global  approach  was  7.50  x  10“^  which  took  about  2,870 
seconds  to  calculate,  and  the  R.M.S.  value  for  the  current  approximation  with  the 
new  centres  was  15.73  x  10“^.  By  applying  the  local  modification,  using  qf  —  7  an 
R.M.S.  value  of  7.52  x  10”^  was  obtained  in  less  than  three  tenths  of  a  second. 
Clearly  it  can  be  seen  that  an  excellent  improvement  in  the  R.M.S.  value  can 
be  achieved  using  only  a  very  small  number  of  basis  functions.  The  required  fit 
has  been  achieved  to  within  an  accuracy  of  one  per  cent  in  approximately  one 
ten  thousandth  of  the  time  needed  by  the  global  approach.  Visually,  there  is  no 
difference  between  the  approximations  produced  by  a)  the  global  approach  and  b) 
the  local  modification  method. 

Future  work  in  this  area  would  concentrate  on  the  following  areas. 

■  Convergence.  Since  each  iteration  is  so  much  faster  than  the  global  approach 
we  can  afford  to  use  the  technique  many  times,  using  different  sets  of  beisis 
functions.  However  it  would  be  necessary  to  confirm  that  such  an  approach 
would  continually  refine  and  improve  the  quality  of  the  approximation. 

■  No  attempt  has  been  made  to  optimize  the  size  or  the  positions  of  the  subset 
of  data  used  for  the  local  modification.  It  is  envisaged  that  a  small  but  signif¬ 
icant  improvement  in  the  quality  of  the  approach  could  be  made  using  such  a 
technique. 

■  Similarly  the  size  and  position  of  the  subset  of  centres  could  be  optimized.  Par¬ 
ticular  attention  could  be  directed  towards  the  effects  of  larger  perturbations 
and/or  several  small  perturbations. 
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This  paper  analyses  the  statistical  convergence  properties  of  the  modified  NLMS  rules  which 
were  formulated  in  an  attempt  to  produce  more  robust  and  faster  converging  training  algorithms. 
However,  the  statistical  analysis  described  in  this  paper  leads  us  to  the  conjecture  that  the  standard 
NLMS  rule  is  the  only  unconditionally  stable  modified  NLMS  training  algorithm,  and  that  the 
optimal  value  of  the  learning  rate  and  region  of  convergence  for  the  modified  NLMS  rules  is 
generally  less  than  for  the  standard  NLMS  rule. 

1  Adaptive  Systems  and  the  NLMS  Algorithm 

Nonlinear  networks,  such  as  the  Cerebellar  Model  Articulation  Controller  (CM AC), 
Radial  Basis  Functions  (RBF)  and  B-splines  have  an  “output  layer”  of  linear  param¬ 
eters  that  can  be  directly  trained  using  any  of  the  linear  learning  algorithms  that 
have  been  developed  over  the  past  40  years.  So  consider  a  linear  in  its  parameter 
vector  network  of  the  form: 

n 

i=l 

=  x’’(<)w(<-l)  (1) 

where  y{t)  is  the  system’s  output,  w(t  -  1)  =  (u;i(t  -  1), ... ,  Wn{t  -  1))  is  the  n- 
dimensional  weight  vector  and  x(t)  =  {xi{t), ...,  x„(t))  is  the  n-dimensional  trans¬ 
formed  input  vector  at  time  t.  This  “input”  vector  x  could  possibly  be  a  nonlinear 
transformation  of  the  network’s  original  input  measurements.  For  a  linear  system 
described  by  equation  1,  the  Normalised  Least  Mean  Squares  (NLMS)  learning 
algorithm  for  a  single  training  sample  {x(t),  J(t)}  is  given  by: 

where  Aw(t)  =  w(t)  —  w(t  —  1)  is  the  weight  vector  update,  ey{t)  =  {y{t)  —  y(t)) 
is  the  output  error,  y(t)  is  the  desired  network  output  at  time  t  and  y  £  [0,2]  is 
the  learning  rate.  This  simple  learning  rule  has  a  long  history,  as  it  was  originally 
derived  by  Kaczmarz  in  1937  [6]  and  was  re-derived  numerous  times  in  the  adaptive 
control  [7]  and  neural  network  literature.  When  the  training  data  are  generated  by 
an  equivalent  model  with  an  unknown  “optimal”  parameter  vector  w  (i.e.  there 
is  no  modelling  error  or  measurement  noise),  the  NLMS  algorithm  possesses  the 
property  that: 

Il^w(0ll2  —  lkw(^  —  1)||2  (3) 

where  evv(^)  =  w  -  w(t)  is  the  weight  error  at  time  t.  Hence  the  estimates  of  the 
weight  vector  approach  (monotonically)  the  true  values  and  this  learning  rule  has 
many  other  desirable  properties  [3]. 
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2  Modified  NLMS  algorithms 

In  1971,  a  set  of  Modified  NLMS  (MNLMS)  algorithms  was  proposed  by  Aved’yan 
[1,  2]  and  they  were  again  rediscovered  over  20  years  later  by  Douglas  [5].  The 
MNLMS  learning  algorithms  are  derived  from  the  two  conditions: 

Find  w(^)  such  that: 

y{t)  =  x^(<)w(<)  (4) 

and  ||Aw(^)||p  is  minimised. 

for  different  values  of  p  where  1  <  p  <  oo. 

These  instantaneous  learning  rules  are  based  on  generating  a  new  search  direction 
by  minimising  an  alternative  norm  in  weight  space,  and  three  special  cases  which 
are  worthy  of  attention  are  the  1,  2  and  oo-norms.  The  2-norm  corresponds  to 
the  standard  NLMS  rule  which  is  denoted  by  std  NLMS,  but  the  Li  learning 
algorithm  is: 

Awi(t)  =  (5) 

where  k  =  argmaxf  |a;t(i)|)  ^his  will  be  referred  to  as  the  max  NLMS  rule. 
The  Loo  training  rule  is  given  by: 

,a, 

and  this  will  be  referred  to  as  the  sgn  NLMS  rule  [7].  It  is  fairly  simple  to  show 
that  the  a  posteriori  output  error  is  always  zero  for  these  learning  rules  (with 
p  =  1),  so  the  only  difference  between  them  is  how  they  search  the  weight  space. 
The  max  NLMS  algorithm  always  updates  the  weight  vector  parallel  to  an  axis,  and 
the  sgn  NLMS  rule  causes  the  weight  vector  update  to  be  at  45®  to  an  axis.  This  is 
in  contrast  to  the  std  NLMS  training  procedure  which  always  projects  the  weight 
vector  perpendicularly  onto  the  solution  hyperplane  generated  by  the  training  data 
[6]. 

3  A  Statistical  Analysis 

A  deterministic  analysis  of  the  MNLMS  rules  has  already  been  completed  and 
it  is  shown  in  [4]  that  certain  finite  training  sets  can  cause  unstable  learning  in 
the  linear  network  for  the  sgn  and  max  NLMS  rules,  irrespective  of  the  size  of 
the  (non-zero)  learning  rate.  The  statistical  analysis  of  the  MNLMS  rules  in  this 
paper  is  based  on  a  model  of  the  input  vector  where  all  components  are  mutually 
independent  random  processes  with  zero  mean  values,  each  of  which  is  a  sequence 
of  independent  identically  symmetric  distributed  random  variables.  This  proivdes 
the  conditions  of  convergence  for  the  modified  algorithms,  the  optimal  value  of  the 
learning  rate,  and  the  influence  of  a  noise  and  the  mean  value  of  the  input  process 
on  the  convergence  conditions. 

3.1  Convergence 

The  process  w(t)  converges  (in  the  statistical  mean-square  sense)  to  the  optimal 
weight  vector,  w,  if  it  satisfies  the  condition  limt-^oo  E  (e^(f)ew(O)  =  9^  where  E  () 
denotes  the  statistical  expectation  operator.  The  convergence  depends  greatly  on 
the  properties  of  the  input  process  x(i),  for  instance  the  standard  NLMS  algorithm 
does  not  converge  to  its  optimal  values,  when  the  process  x(/)  quickly  settles  to  a 
constant  value  or  when  it  is  varies  very  slowly:  the  input  signal  is  not  persistently 
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exciting.  This  situation  is  even  worse  for  both  the  max  and  sgn  NLMS  algorithms 
as  they  search  a  restricted  section  of  the  parameter  space  and  an  optimal  solution 
for  the  training  date  does  not  always  lie  in  this  region  [3]. 

It  is  possible  to  write  the  Euclidean  squared  norm  of  the  error  in  the  weight  vector 

fw(0^w(0  as: 

where  s{t)  is  the  current  search  direction  for  the  different  learning  rules,  when  the 
training  data  contain  no  measurment  or  modelling  errors. 

Let  us  denote  by  V^{t)  =  E  {el(t)ey,(t))  the  statistical  expectation  of  the  Euclidean 
squared  norm  of  the  error  in  the  weight  vectors  in  equation  7.  Taking  into  account 
the  statistical  properties  of  the  input  vector  x{t)  and  the  fact  that  the  trace  of 
each  matrix  in  these  equations  is  equal  to  one,  it  is  possible  to  generate  first-order 
relations: 

=  (1  -  2n- V  +  l/2(i  -  1)  (8) 

where  a  *  is  used  instead  of  std,  sgn  and  max,  and: 

/?std  =  n  ^  ('91 


x^(<)x(<) 

(x^(<)sgn(x(i)))^ 


Pssn  —  E 


For  these  parameters  we  have  the  following  inequalities: 

^  Psgn  (12) 

^  ^  /^max  1  (13) 

The  value  of  the  parameters  ^sgn  and  /?max  depends  on  the  probability  distribution 
function  of  the  vector  x(t)  and  can  be  calculated  analytically  in  the  special  cases 
shown  below. 

FYom  equation  8,  it  follows  that  not  only  are  the  convergence  conditions  for  the  std, 
sgn  and  max  NLMS  algorithms,  respectively: 

0  <  g*  =  (1  -  272“  V  +  <  1  (14) 

but  also  the  optimal  value  of  the  learning  rate  Ji  by  which  values  of  gstd,gsgn  and 
7max  are  minimal: 

?*  =  (n/?*)“^  (15) 

and  consequently: 

q^  =  {l-P^n~^)  (16) 

These  values  determine  the  convergence  time  of  the  respective  algorithms,  and: 


^sgn 

< 

/^std 

=  1 

(17) 

f* 

'max 

< 

A^std 

=  1 

(18) 

5sgn 

> 

^std 

=  (l-n-^) 

(19) 

r* 

max 

> 

Cd 

(20) 
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Hence  it  follows  that  optimal  value  of  the  learning  rate  and  the  region  of  convergence 
for  the  sgn  and  max  NLMS  rules  are  less  than  for  the  std  NLMS  rule,  and  that 
the  convergence  time  for  the  std  NLMS  rule  is  less  than  for  the  other  sgn  and 
max  NLMS  rules.  However,  it  is  possible  to  get  the  analytical  expression  for  /?  in 
equations  10  and  11  in  certain  special  cases. 

When  Xi{t)  has  a  Laplace  distribution:  p{xi)  =  exp  (- |a?j| /a),  then  /?sgn  = 
2/(n  +  1)  and  /isgn  =  (n  +  l)/2n  for  the  sgn  NLMS  rule.  The  corresponding  region 
of  convergence  is  equal  to  0  <  /imax  <  (n  +  1)  in  contrast  with  the  std  NLMS  rule 
where  0  <  pstd  <  2.  Hence,  for  large  n,  the  convergence  rate  of  the  sgn  NLMS  rule 
is  approximately  twice  as  large  as  the  std  NLMS  rule. 

When  Xi{t)  has  a  uniform  distribution  p{xi)  —  l/2a,  |xi|  <  o,  then  /?max  =  (w  + 
2)/3n,  and  pmax  =  3/(n  +  2).  The  corresponding  region  of  convergence  is  0  < 
Pm&x  <  6/(n  +  2)  and  it  follows  that  for  large  n,  the  time  of  convergence  of  the 
std  NLMS  rule  is  equal  to  cn  where  c  is  a  constant  whereeis  for  the  max  NLMS  rule 
this  time  is  equal  to  cn{n  +  2)/3. 

These  examples  shows  that  the  std  NLMS  rule  has  a  larger  convergence  rate  when 
compared  with  the  sgn  and  the  max  NLMS  algorithm  for  these  specific  input  distri¬ 
butions.  The  MNLMS  rules  therefore  represent  a  tradeoff  between  computational 
simplicity  and  stability/convergence  rate. 


3.2  The  Influence  of  Noise 

Assuming  that  the  output  data  are  corrupted  with  an  additive,  statistically  inde¬ 
pendent,  white  noise  sequence  ({k),  with  variance  <t|,  and  that  the  input  vector  has 
the  same  statistical  properties  as  were  mentioned  above,  then: 

Vf{t)  =  (1  -  2nfi  -f  -  1)  -f  (21) 

where  the  disturbance  terms  are  given  by: 

dsti  =  £;  ((x^(i)x(O)"')  (22) 

dsgn  =  n£;  ((x^(<)sgn(x(<)))"')  (23) 

rf„.ax  =  S(|^,-'(<)i)  (24) 

Comparing  equations  8  and  21,  it  follows  that  if  the  convergence  conditions  for  the 

std,  sgn  and  max  NLMS  algorithms  (see  inequality  14)  remain  constant,  then  after 
a  transient  period,  the  variance  equals: 


2  -  /i 

(26) 

After  the  initial  “transient”  convergence,  the  mean  value  of  the  process  w(t)  will 
be  equal  to  w  and  the  weight  vector  will  “jitter”  around  the  mean  value  with  a 
variance  given  in  equations  25-27.  This  region  of  convergence  was  termed  a  minimal 
capture  zone  by  Parks  and  Militzer  [8].  For  a  specific  input  distribution,  the  size  of 
the  minimal  capture  zone  can  be  made  arbitrary  small  by  choosing  a  small  learning 
rate,  a,  but  this  would  also  decrease  the  rate  of  convergence. 
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3.3  The  Influence  of  Mean  Value  of  the  Input  Process 

The  presence  of  the  mean  value  in  the  input  vector  decreases  the  rate  of  convergence 
for  all  these  algorithms.  For  the  std  NLMS  algorithm,  the  presence  of  non  zero  mean 
value,  rrix,  in  the  input  vector  increases  the  time  of  convergence  approximately 
(l  +  ml /ax)  times  in  comparison  with  the  case  when  the  mean  value  is  zero.  Hence, 
it  is  desirable  to  both  centralise  the  input  vector  and  also  transform  it  such  that 
all  components  of  the  vector  have  the  same  variances. 

3.4  Simulation  and  Discussion 

The  simulation,  shown  in  Figure  1,  illustrates  the  dynamical  evolution  of  the  Eu¬ 
clidean  norm  of  the  error  in  the  weight  vector  for  these  algorithms.  In  this  figure,  the 
learning  rates  are  set  to  their  near-optimal  values  derived  above.  The  adaptive  sys¬ 
tem  has  5  inputs,  the  optimal  weight  vector  is  w  =  (0.22,  0.32, 0.42, 0.53, 0.63)  and 
the  random,  Gaussian  input  vector  has  the  statistical  properties  described  above 
and  these  results  support  the  theoretical  conclusions. 


Figure  1  The  evolution  of  the  magnitude  of  the  error  in  the  weight  vector  when 
it  is  trained  using  the  std  NLMS  rule  (solid  line),  sgn  NLMS  rule  (dashed  line)  and 
the  max  NLMS  rule  (dotted  line),  with  =  1  and  A^sgn  =  Mmax  =  0.5. 

The  MNLMS  rules  were  originally  proposed  as  a  computationally  efficient  method 
for  increasing  the  rate  of  convergence  of  the  NLMS  algorithm  as  they  search  weight 
space  in  orthogonal  directions.  However,  this  has  serious  implications  for  the  stabil¬ 
ity  and  rate  of  convergence  of  these  instantaneous  algorithms  as  has  been  pointed 
out  in  this  paper.  For  any  network  with  3  or  more  coefficients,  the  MNLMS  pro¬ 
cedures  could  potentially  be  unstable  and  their  persistently  exciting  conditions  are 
more  severe  than  the  standard  NLMS  rule. 
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We  extend  the  recent  progress  in  thermodynamic  limit  analyses  of  mean  on-hne  gradient  descent 
learning  dynamics  in  multi-layer  networks  by  calculating  the  fluctuations  possessed  by  finite  di¬ 
mensional  systems.  Fluctuations  from  the  mean  dynamics  are  largest  at  the  onset  of  specialisation 
as  student  hidden  unit  weight  vectors  begin  to  imitate  specific  teacher  vectors,  and  increase  with 
the  degree  of  symmetry  of  the  initial  conditions.  Including  a  term  to  stimulate  asymmetry  in  the 
learning  process  typically  significantly  decreases  finite  size  effects  and  training  time. 


Recent  advances  in  the  theory  of  on-line  learning  have  yielded  insights  into  the 
training  dynamics  of  multi-layer  neural  networks.  In  on-line  learning,  the  weights 
parametrizing  the  student  network  are  updated  according  to  the  error  on  a  single 
example  from  a  stream  of  examples,  generated  by  a  teacher  network 

The  analysis  of  the  resulting  weight  dynamics  has  previously  been  treated 
by  assuming  an  infinite  input  dimension  {thermodynamic  limit)  such  that  a  mean 
dynamics  analysis  is  exact[2].  We  present  a  more  realistic  treatment  by  calculating 
corrections  to  the  mean  dynamics  induced  by  finite  dimensional  inputs[3]. 

We  assume  that  the  teacher  network  the  student  attempts  to  learn  is  a  soft  com¬ 
mittee  machine[l]  of  N  inputs,  and  M  hidden  units,  this  being  a  one  hidden  layer 
network  with  weights  connecting  each  hidden  to  output  unit  set  to  +1,  and  with 
each  hidden  unit  n  connected  to  all  input  units  by  B„(n  =  I..M).  Explicitly,  for 
the  N  dimensional  training  input  vector  the  output  of  the  teacher  is  given  by, 

M 

=  (1) 

«  =  ! 

where  g{x)  is  the  activation  function  of  the  hidden  units,  and  we  take  g{x)  = 
erf  (a:/ V2).  The  teacher  generates  a  stream  of  training  examples  C^),  with  input 
components  drawn  from  a  normal  distribution  of  zero  mean,  unit  variance.  The 
student  network  that  attempts  to  learn  the  teacher,  by  fitting  the  training  examples, 
is  also  a  soft  committee  machine,  but  with  K  hidden  units.  For  input  the  student 
output  is, 

=  (2) 

t=:l 

where  the  student  weights  J  =  {J,}(!  ==  I..K)  are  sequentially  modified  to  reduce 
the  error  that  the  student  makes  on  an  input 

1  1  /  ^  M  \  2 

e(j,n  =  2WJ.n-n^  =  5  ■  (3) 

\i=l  n=l  / 
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with  the  activations  defined  xf  =  Jf^",  and  j/g  =  Gradient  descent  on  the 

error  (3)  results  in  an  update  of  the  student  weight  vectors, 

j/'+i  =  j*-  -  (4) 


_n=l  j-1  J 

and  g'  is  the  derivative  of  the  activation  function  g.  The  typical  performance  of  the 
student  on  a  randomly  selected  input  example  is  given  by  the  generalisation  error, 

^.  =  WJ,«)).  (6) 

where  (..)  represents  an  average  over  the  gaussian  input  distribution.  One  finds 
that  eg  depends  only  on  the  overlap  parameters,  Rin  =  Qij  —  Ji-Jj, 

and  Tnm  =  l..A^; n,m  =  L.M)[2],  for  which,  using  (4),  we  derive 

(stochastic)  update  equations, 

Fin"  -  Fin  =  (7) 


We  average  over  the  input  distribution  to  obtain  deterministic  equations  for  the 
mean  values  of  the  overlap  parameters,  which  are  self- averaging  in  the  thermody¬ 
namic  limit.  In  this  limit  we  treat  =  a  as  a  continuous  variable  and  form 

differential  equations  for  the  thermodynamic  overlaps, 

^^ViSiVn),  (9) 


^  =  r){SiX,+S,Xi)-\-g^{SiSO.  (10) 

da 

For  given  initial  overlap  conditions,  (9,10)  are  integrated  to  find  the  mean  dynamical 
behaviour  of  a  student  learning  a  teacher  with  an  arbitrary  numbers  of  hidden 
units[2]  (see  fig.(la)).  Typically,  Cg  decays  rapidly  to  a  symmetric  phase  in  which 
there  is  near  perfect  symmetry  between  the  hidden  units.  Such  phases  exist  in 
learnable  scenarios  until  sufficient  examples  have  been  presented  to  determine  which 
student  hidden  unit  will  mimic  which  teacher  hidden  unit.  For  perfectly  symmetric 
initial  conditions,  such  specialisation  is  impossible  in  a  mean  dynamics  analysis.  The 
more  symmetric  the  initial  conditions  are,  the  longer  the  trapping  in  the  symmetric 
phase  (see  fig. (2a)).  Large  deviations  from  the  mean  dynamics  can  exist  in  this 
symmetric  phase,  as  a  small  perturbation  from  symmetry  can  determine  which 
student  hidden  unit  will  specialise  on  which  teacher  hidden  unit[l]. 

We  rewrite  (7,8)  in  the  general  form 

„*■+!_  a"  =  ^  (F„  +  .  (11) 

where  Fa  +  r}Ga  is  the  update  rule  for  a  general  overlap  parameter  a.  In  order  to 
investigate  finite  size  effects,  we  make  the  following  ansaetze  for  the  deviations  of 


86 


Chapter  11 


the  update  rules  Fa  (the  same  form  is  made  for  Ga)  and  overlap  parameters  a  from 
their  thermodynamic  values/ 


Fa~F^-\-AFa-{-^F^,  a  =  +  (12) 

where  {AFq)  =  {Aa)  =  0.  The  update  rule  ansatz  is  motivated  by  observing 
that  the  activations  have  variance  0(1)  which,  iterated  through  (11)  yield  overlap 
variances  of  O  {N  ^).  Terms  of  the  form,  Aa  represent  dynamic  corrections  that 
arise  due  to  the  random  examples,  and  represent  static  corrections  such  that 
the  mean  of  the  overlap  parameter  a  is  given  by  a^  +  rja^  /N  -  the  thermodynamic 
average  plus  a  correction.  In  order  to  simplify  the  analysis,  we  assume  a  small 
learning  rate,  rj,  so  that  the  thermodynamic  overlaps  are  governed  by, 


da^ 

da 


(13) 


where  F^  is  the  update  rule  Fa  averaged  over  the  input  distribution,  and  the  rescaled 
learning  rate  is  given  by 


a  =  rja.  (14) 

Substituting  (12)  in  (11)  and  averaging  over  the  input  distribution,  we  derive  a  set 
of  coupled  differential  equations  ^  for  the  (scaled)  covariances  (AaAb),  and  static 
corrections  a^, 


^  =  E  H  +  E  (A*Ao> 


1  da^ 

2  da^  da 


be 


dbOdc^ 


(15) 

(16) 


Summations  are  over  all  overlap  parameters,  {Qij,Rin\iJ  =  l../f,  n  =  1..M}.  The 
elements  (AFaAFi)  are  found  explicitly  by  calculating  the  covariance  of  the  update 
rules  Fa,  and  Fi,.  Initially,  the  fluctuations  {AFaAF},)  are  set  to  zero,  and  equations 
(13,15)  are  then  integrated  to  find  the  evolution  of  the  covariances,  cov(a,6)  = 
{r]/N)  (AaAb),  and  the  corrections  to  the  thermodynamic  average  values,  {T)/N)a^. 
The  average  finite  size  correction  to  the  generalisation  error  is  given  by 


where. 


'g  ^  j^^g^ 


E»‘f +  iE(A.A.) 


ab 


da^db^ 


(17) 

(18) 


These  results  enable  the  calculation  of  finite  size  effects  for  an  arbitrary  learning  sce¬ 
nario.  For  demonstration,  we  calculate  the  finite  size  effects  for  a  student  with  two 
hidden  units  learning  a  teacher  with  one  hidden  unit.  In  this  over-realisable  case, 
one  of  the  student  hidden  units  eventually  specialises  on  the  single  teacher  hidden 
unit,  while  the  other  student  hidden  unit  decays  to  zero.  In  fig. (1),  we  plot  the  ther¬ 
modynamic  limit  generalisation  error  alongside  the  O  (iV“^)  correction.  In  fig. (la) 
there  is  no  significant  symmetric  phase,  and  the  finite  size  corrections  (fig.(lb)) 


the  order  parameter  represented  by  c  is  Qn ,  then  c®  =  ,  and  Ac  =  AQn . 

^The  smaU-fluctuations  ansatz  necessarily  yields  equations  of  the  same  form  as  presented  in 
[4l  for  the  weight  component  dynamics.  Here  they  are  for  the  order  parameter  of  the  system. 
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Figure  1  Two  student  hidden  units, 
one  teacher  hidden  unit.  Non  zero  initial 
parameters:  Qn  =  0.2,(522  =  ^11  =  0.1. 
(a)  Thermodynamic  generedisation  error, 
e°.  (b)  O  correction  to  the  gen¬ 

eralisation  error  ,  Simulation  results 
for  iV  =  10,77  =  0.1  and  (hailf  standard 
deviation)  error  bars  are  drawn. 
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Figure  2  Two  student  hidden  units, 
one  teacher  hidden  unit.  Initi2Jly,  Qn  = 
0.1,  with  all  other  parameters  set  to  zero, 
(a)  Thermodyn^lmic  generalisation  error 
e°.  (b)  O  correction  to  the  gener¬ 

alisation  error,  e^. 


are  small.  For  a  finite  size  correction  of  less  than  10%,  we  would  require  an  input 
dimension  of  around  N>2br).  For  the  more  symmetric  initial  conditions  (fig.(2a)) 
there  is  a  very  definite  symmetric  phase,  for  which  a  finite  size  correction  of  less 
than  10%  (fig. (2b))  would  require  an  input  dimension  of  around  N  >  50,  OOOr/.  As 
the  initial  conditions  approach  perfect  symmetry,  the  finite  size  effects  diverge,  and 
the  mean  dynamical  theory  becomes  inexact.  Using  the  covariances,  we  can  analyse 


Figure  3  (a)  The  normalised  compo¬ 

nents  of  the  principal  eigenvector  for  the 
isotropic  teacher.  M  =  K  —  2,  (Q22  = 
Q  11,^22  =  Non  zero  initial  pa¬ 

rameters  Qii  =  0.2,  Q22  —  0.1,  Hii  = 

0.001  ,i?22  =  0.001. 


Figure  4  Two  student  hidden  units, 
one  teacher  hidden  unit.  The  initifil 
conditions  are  as  in  fig. (2). (a)  Ther¬ 
modynamic  generalisation  error,  e°.  (b) 
O  (iV“^  )  correction  to  the  generalisation 
error, 


the  way  in  which  the  student  breaks  out  of  the  symmetric  phase  by  specialising  its 
hidden  units.  For  the  isotropic  teacher  scenario  Tnm  =  ^nm,  and  M  =  K  —  2,  learn- 
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ing  proceeds  such  that  one  can  approximate,  Q22  =  Qii,  -R22  =  By  analysing 
the  eigenvalues  of  the  covariance  matrix  (AaA6),  we  found  that  there  is  a  sharply 
defined  principal  direction,  the  components  of  which  we  show  in  fig. (3).  Initially,  all 
components  of  the  principal  direction  are  similarly  correlated,  which  corresponds  to 
the  symmetric  region.  Then,  around  a  =  20,  as  the  symmetry  breaks,  Rn  and  R21 
become  maximally  anti-correlated,  whilst  there  is  minimal  correlation  between  the 
Qii  and  Q12  components.  This  corresponds  well  with  predictions  from  perturbation 
analysis[2].  The  symmetry  breaking  is  characterised  by  a  specialisation  process  in 
which  each  student  vector  increases  its  overlap  with  one  particular  teacher  weight, 
whilst  decreasing  its  overlap  with  other  teacher  weights.  After  the  specialisation 
has  occured,  there  is  a  growth  in  the  anti-correlation  between  the  student  length 
and  its  overlap  with  other  students.  The  asymptotic  values  of  these  correlations  are 
in  agreement  with  the  convergence  fixed  point,  R"^  =  Q  =  1. 

In  light  of  possible  prolonged  symmetric  phases,  we  break  the  symmetry  of  the 
student  hidden  units  by  imposing  an  ordering  on  the  student  lengths,  Qu  >  Q22  > 
•••  >  Qkk,  which  is  enforced  in  a  ‘soft’  manner  by  including  an  extra  term  to  (3), 

lif-i 

~  9  ^  {Qj+ij+i  ~  Qjj) )  (19) 

^  i=i 

where  h{x)  approximates  the  step  function, 

'*W  =  5(l  +  erf(-^a:)).  (20) 

This  straightforward  modification  involves  the  addition  of  a  gaussian  term  in  the 
student  weight  lengths  to  the  weight  update  rule  (4).  In  fig. (4),  we  show  the  overlap 
parameters  and  their  fluctuations  for  ^=10,  A'  =  2,  M  =  1.  This  graph  is  to  be 
compared  to  fig. (2)  for  which  the  initial  conditions  are  the  same.  There  is  now  no 
collapse  to  an  initial  symmetric  phase  from  which  the  student  will  eventually  spe¬ 
cialize.  Also,  the  initial  convergence  to  the  optimal  values  is  much  faster.  As  there 
is  no  symmetric  phase,  the  finite  size  corrections  are  much  reduced  and  are  largest 
around  the  initial  value  of  a  where  the  overlap  parameters  are  most  symmetric, 
decreasing  rapidly  due  to  the  driving  force  away  from  this  near-symmetric  region. 
For  the  case  in  which  the  teacher  weights  are  equal,  the  constraint  (19)  prevents  the 
student  from  converging  optimally.  A  naive  scheme  to  prevent  this  is  to  adapt  the 
steepness,  (3,  such  that  it  is  inversely  proportional  to  the  average  of  the  gradients 
Qu,  which  decreases  as  the  dynamics  converge  asymptotically. 

We  conjecture  that  such  symmetry  breaking  is  potentially  of  great  benefit  in  the 
practical  field  of  neural  network  training. 

REFERENCES 

[1]  M.  Biehl  and  H.  Schwarze,  Learning  by  online  gradient  descent,  Journal  of  Physics  A  Vol.28 
(1995),pp643-656. 

[2]  D.  Saad  and  S  .Solla,  Exact  solution  for  online  learning  in  multilayer  neural  networks.  Phys¬ 
ical  Review  Letters,  Vol.  74(21)  (1995).  pp4337-4340. 

[3]  P.  SoUich,  Finite  size  effects  in  learning  and  generalization  in  linear  perceptrons,  Journal  of 
Physics  A  Vol.27  (1994),  pp7771-7784. 

[4]  T.  Heskes,  Journal  of  Physics  A  Vol.27  (1994),  pp5145-5160. 

Acknowledgements 

This  work  was  partially  supported  by  the  EU  grant  ERB  CHRX-CT92-0063. 
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The  paper  presents  a  theoretical  proof  revealing  an  intrinsic  limitation  of  digitetl  VLSI  technology: 
its  inability  to  cope  with  highly  connected  structures  (e.g.  neural  networks).  We  are  in  fact  able 
to  prove  that  efficient  digital  VLSI  implementations  (known  as  VLSI-optimal  when  minimising 
the  AT^  complexity  measure  —  A  being  the  area  of  the  chip,  and  T  the  delay  for  propagating  the 
inputs  to  the  outputs)  of  neural  networks  are  achieved  for  small-constant  fan-in  gates.  This  result 
builds  on  quite  recent  ones  dealing  with  a  very  close  estimate  of  the  area  of  neural  networks  when 
implemented  by  threshold  gates,  but  it  is  also  valid  for  classical  Boolean  gates.  Limitations  and 
open  questions  are  presented  in  the  conclusions. 

Keywords:  neural  networks,  VLSI,  fan-in,  Boole£in  circuits,  threshold  circuits,  Fn,m  ftmctions. 

1  Introduction 

In  this  paper  a  network  will  be  considered  an  acyclic  graph  having  several  input 
nodes  {inputs)  and  some  (at  least  one)  output  nodes  {outputs).  The  nodes  are 
characterised  by  fan-in  (the  number  of  incoming  edges  —  denoted  by  A)  and  fan¬ 
out  (the  number  of  outgoing  edges),  while  the  network  has  a  certain  size  (the 
number  of  nodes)  and  depth  (the  number  of  edges  on  the  longest  input  to  output 
path).  If  with  each  edge  a  synaptic  weight  is  associated  and  each  node  computes  the 
weighted  sum  of  its  inputs  to  which  a  non-linear  activation  function  is  then  applied 
{artificial  neuron),  the  network  is  a  neural  network  (NN): 

'^k  =  (20,  .■■,^n-i)  e  IR”,/?  =  1,  and  f{Ek)  =  a  ,  (1) 

with  Wi  e  IR  the  synaptic  weights,  ^  G  IR  known  as  the  threshold,  and  sigma  a 
non-linear  activation  function.  If  the  non-linear  activation  function  is  the  threshold 
(logistic)  function,  the  neurons  are  threshold  gates  (TGs)  and  the  network  is  just  a 
threshold  gate  circuit  (TGC)  computing  a  Boolean  function  (BF).  The  cost  functions 
associated  to  a  NN  are  depth  and  size.  These  are  linked  to  T  «  depth  and  A  size 
of  a  VLSI  chip.  Unfortunately,  NNs  do  not  closely  follow  these  proportionalities  as: 

■  the  area  of  the  connections  counts  [2,  3, 9]; 

■  the  area  of  one  neuron  is  related  to  its  associated  weights. 

That  is  why  the  size  and  depth  complexity  measures  are  not  the  best  criteria  for 
ranking  different  solutions  when  going  to  silicon  [11].  Several  authors  have  taken  into 
account  the  fan-in  [1,  9,  10,  12],  the  total  number  of  connections,  the  total  number 
of  bits  needed  to  represent  the  weights  [8,  15]  or  even  more  precise  approximations 
like  the  sum  of  all  the  weights  and  thresholds  [2-7]: 

area  a  ^  +  (2) 

all  neurons  \i=0  / 

An  equivalent  definition  of  ‘complexity’  for  a  NN  is  Yll=o  worth 

mentioning  that  there  are  also  several  sharp  limitations  for  VLSI  implementations 
like:  (i)  the  maximal  value  of  the  fan-in  cannot  grow  over  a  certain  limit;  (ii)  the 


89 


90 


Chapter  12 


maximal  ratio  between  the  largest  and  the  smallest  weight.  For  simplification,  in  the 
following  we  shall  consider  only  NNs  having  n  binary  inputs  and  k  binary  outputs. 
If  real  inputs  and  outputs  are  needed,  it  is  always  possible  to  quantize  them  up  to 
a  certain  number  of  bits  such  as  to  achieve  a  desired  precision.  The  fan-in  of  a  gate 
will  be  denoted  by  A  and  all  the  logarithms  are  taken  to  base  2  except  mentioned 
otherwise.  Section  2  will  present  previous  results  for  which  proofs  have  already  been 
given  [2-7].  In  section  3  we  shall  prove  our  main  claim  while  also  showing  several 
simulation  results. 

2  Background 

A  novel  synthesis  algorithm  evolving  from  the  decomposition  of  COMPARISON 
has  recently  been  proposed.  We  have  been  able  to  prove  that  [2,  3]: 

Proposition  1  The  computation  of  COMPARISON  of  two  n-bit  numbers  can 
be  realised  by  a  A-ary  tree  of  size  0{n/A)  and  depth  0(\ogn/  log  A)  for  any  integer 
fan-in  2  <  A  <  n. 

A  class  of  Boolean  functions  having  the  property  that  V/a  €  Fa  is  lin¬ 
early  separable  has  afterwards  been  introduced  as:  Hhe  class  of  functions  /a  of 
A  input  variables,  with  A  even,  /a  —  /A(ffA/2-i5  eA/2-i5  •••> ^o),  and  comput¬ 
ing  /a  [^j  ^  By  convention,  we  consider  Ci 

1.  One  restriction  is  that  the  input  variables  are  pair- depen  dent,  meaning  that 
we  can  group  the  A  input  variables  in  A/2  pairs  of  two  input  variables  each: 
(5a/2-1j^a/2-i)>  •••)  (^o,  Co),  and  that  in  each  such  group  one  variable  is  ‘dominant’ 
(i.e.  when  a  dominant  variable  is  1,  the  other  variable  forming  the  pair  will  also  be 
1): 

Fa  {/aI/a  :  {(0, 0),  (0, 1),  (1,  l)f {0, 1},  A/2  6  IN*, 


A/2-1 

/a=  V 

j=0 


.9i 


Ci,  ^ 


=  0,1,..., A/2 


-} 


Each  /a  can  be  built  starting  from  the  previous  one  /a -2  (having  a  lower  fan-in) 
by  copying  its  synaptic  weights;  the  constructive  proof  has  led  to  [5]: 

Proposition  2  The  COMPARISON  of  two  n-bit  numbers  can  be  computed 
by  a  A-ary  tree  neural  network  with  polynomially  bounded  integer  weights  and 
thresholds  (<  n*)  having  size  0{n/A)  and  depth  C?(logn/log  A)  for  any  integer 


fan-in  3  <  A  <  log  n. 


For  a  closer  estimate  of  the  area  we  have  used  equation  (2)  and  proved  [5]; 
Proposition  3  The  neural  network  with  polynomially  bounded  integer  weights 
(and  thresholds)  computing  the  COMPARISON  of  two  n-bit  numbers  occupies  an 
area  of  0{n  •  2'^/^/ A)  for  all  the  values  of  the  fan-in  (A)  in  the  range  3  to  0(log  n). 
The  result  presented  there  is: 


AT\n,A) 


2^f^  8nA  -  6n  -  5A  log^ 


-n( n-2^f'^\ 
V  Alog^A  ) 


(3) 


A  A-2  log^A 

and  for  A  =  logn  this  is  the  best  (i.e.  smallest)  one  reported  in  the  literature. 
Further,  the  synthesis  of  a  class  of  Boolean  functions  Fn,m  —  functions  of  n  input 
variables  having  m  groups  of  ones  in  their  truth  table  [13]  —  has  been  detailed  [4]: 
Proposition  4  Any  function  /  6  F^.m  can  be  computed  by  a  neural  network 
with  polynomially  bounded  integer  weights  (and  thresholds)  of  size  0{mn/A)  and 
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depth  C)(log(m77)/log  A)  and  occupying  an  area  of  0(mn  •  2'^/A)  if  2m  <  2^  for 
all  the  values  of  the  fan-in  (A)  in  the  range  3  to  0{logn). 

More  precisely  we  have: 


A{n,  m,  A)  <  2m 
which  leads  to: 


4n  •  2^  5(n  -  A)  •  2^/^ 

A  A(A  -  2) 


log(mn) 
log  A 
2m-  r 
A-  1 


mn  •  2^ 


.^9/  /mn  •  log  (mn)  ■  2^  \ 

=  (4) 

For  2m  >  2^  the  equations  are  much  more  intricate,  while  the  complexity  values 
for  area  and  for  AT^  are  only  reduced  by  a  factor  (equal  to  the  fan-in  [6,  7]).  If  we 
now  suppose  that  a  feed-forward  NN  of  n  inputs  and  k  outputs  is  described  by  m 
examples,  it  can  be  directly  constructed  as  simultaneously  implementing  k  different 
functions  from  [4,  6,  7]: 

Proposition  5  Any  set  of  k  functions  /€Fn,*,  i  =  1,  2, ...,  m,  z  <  m  <  2^“^  can 
be  computed  by  a  neural  network  with  polynomially  bounded  integer  weights  (and 
thresholds)  having  size  0(m(2n  -1-  k)/A),  depth  C7(log(mn)/log  A)  and  occupying 
an  area  of  0{mn  •  2^/ A  +  rnk)  if  2m  <  2^,  for  all  the  values  of  the  fan-in  (A)  in 
the  range  3  to  O(logn). 

The  architecture  has  a  first  layer  of  COMPARISONS  which  can  either  be  imple¬ 
mented  using  classical  Boolean  gates  (BGs)  or  —  as  it  has  been  shown  previously 
—  by  TGs.  The  desired  function  can  be  synthesised  either  by  one  more  layer  of 
TGs,  or  by  a  classical  two  layers  AND-OR  structure  (a  second  hidden  layer  of 
AND  gates  —  one  for  each  hypercube),  and  a  third  layer  of  k  OR  gates  represents 
the  outputs.  For  minimising  the  area  some  COMPARISONS  could  be  replaced  by 
AND  gates  (like  in  a  classical  disjunctive  normal  form  implementation). 

3  Which  is  the  VLSI-Optimal  Fan-In? 

Not  wanting  to  complicate  the  proofs,  we  shall  determine  the  VLSI-optimal  fan-in 
when  implementing  COMPARISON  (in  fact:  F„  i  functions)  for  which  the  solution 
was  detailed  in  Propositions  1  to  3.  The  same  result  is  valid  for  F„^m  functions  as 
can  be  intuitively  expected  either  by  comparing  equations  (3)  and  (4),  or  because: 

■  the  delay  is  determined  by  the  first  layer  of  COMPARISONS;  while 

■  the  area  is  determined  by  the  same  first  layer  of  COMPARISONS  (the  ad¬ 
ditional  area  for  implementing  the  symmetric  ‘alternate  addition’  [4]  can  be 
neglected). 

For  a  better  understanding  we  have  plotted  equation  (3)  in  Figure  1. 

Proposition  6  The  VLSI- optima/  (which  minimises  the  AT^)  neural  network 
which  computes  the  COMPARISON  of  two  n-bit  numbers  has  small-constant /an-in 
‘neurons’  with  small- const  ant  bounded  weights  and  thresholds. 

Proof;  Starting  from  the  first  part  of  equation  (3)  we  can  compute  its  derivative: 

d{AT^)  _  2^/^\og^n 


A2(A-2)2  1og®A 


X  (8nA^  log  A  —  22nA^  log  A  +  12nA  log  A 
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Figure  1  The  AT"^  values  of  COMPARISON  —  plotted  as  a  3D  surface  —  versus 
the  number  of  inputs  n  and  the  fan-in  A  for:  (a)  many  inputs  n  <  1024  (4  <  A  < 
20);  and  (b)  few  inputs  n  <  64  (4  <  A  <  20).  It  can  be  very  clearly  seen  that  a 
‘valley’  is  formed  and  that  the  ‘deepest’  points  constantly  lie  somewhere  between 
^minim  —  5  and  ^maxim  —  10. 


-  5  log  A  +  10  A^  log  A 


16 


24 


24 

li^ 


In 2"^  logA  +  — nAlogA 


10 


32 


88 


48 


n  log  A  +  —  A"  log  A  -  — nA^  +  — nA  -  — n 


In  2 


In  2 


In  2 


ln2 


20 

ln2 


A2- 


40 

ln2 


which  —  unfortunately  —  involves  transcendental  functions  of  the  variables  in  an 
essentially  non-algebraic  way.  If  we  consider  the  simplified  ‘complexity’  version  of 
equation  (3)  we  have: 

^  d  /nlog2n.2^/2\  _  2^/^  /ln2  1  2  \ 

dA  dA\  Alog^A  /  Alog^A  V2  A  AlnA/ 
which  when  equated  to  zero  leads  to  lnA(Aln2  -  2)  =  4  (also  a  transcendental 
equation).  This  has  A  =  6  as  ‘solution’  and  as  the  weights  and  the  thresholds  are 
bounded  by  2^^^  {Proposition  >/)  the  proof  is  concluded.  □ 

The  proof  has  been  obtained  using  several  successive  approximations:  neglecting 
the  ceilings  and  using  a  ‘simplified’  complexity  estimate.  That  is  why  we  present  in 
Figure  2  exact  plots  of  the  AT^  measure  which  support  our  previous  claim.  It  can 
be  seen  that  the  optimal /an-m  ‘constantly’  lies  between  6  and  9  (as  AopUm  =  6. ..9, 
one  can  minimise  the  area  by  using  COMPARISONS  only  if  the  group  of  ones 
has  a  length  of  a  >  64  —  see  [4-7]).  Some  plots  in  Figure  2  are  also  including 
a  TG-optimal  solution  denoted  by  SRK  [14]  and  the  logarithmic  fan-in  solution 
(A  logn)  denoted  BJg  [5]. 


4  Conclusions 

This  paper  has  presented  a  theoretical  proof  for  one  of  the  intrinsic  limitations  of 
digital  VLSI  technology:  there  are  no  ‘optimal’  solutions  able  to  cope  with  highly 
connected  structures.  For  doing  that  we  have  proven  the  contrary,  namely  that 
constant  fan-in  NNs  are  VLSI- optimal  for  digital  architectures  (either  Boolean  or 
using  TGs).  Open  questions  remain  concerning  fi/’ and  ‘Ae?/;’such  a  result  could 
be  extended  to  purely  analog  or  mixed  analog/digital  VLSI  circuits. 
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e)  "  f) 


Figure  2  The  values  of  COMPARISON  for  different  number  of  inputs  n 
and  fan-in  A  (B_A):  (a)  for  4  <  n  <  32  including  the  SRK  [14]  solution;  (b) 
detail  showing  the  optimum  fan-in  for  the  same  intervetl  (4  <  n  <  32);  (c)  for 
32  <  n  <  256  including  the  SRK  [14]  solution;  (d)  detail  showing  the  optimum 
fan-in  for  the  same  interval  (32  <  n  <  256);  (e)  for  256  <  n  <  1024  including 
the  SRK  [14]  solution;  (f)  detail  showing  the  optimum /aw- for  the  same  interval 
(256  <  n  <  1024). 
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THE  APPLICATION  OF  BINARY  ENCODED  2ND 
DIFFERENTIAL  SPECTROMETRY  IN 
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This  paper  describes  classification  of  UV-Vis  optical  absorption  spectra  by  binary  encoding  seg¬ 
ments  of  the  second  derivative  of  the  absorption  spectra  according  to  their  shape.  This  allows 
successful  classification  of  spectra  using  the  Back  Propagation  Neural  Network  analysis  (BPNN) 
^llgo^ithm  where  other  preprocessing  schemes  have  failed.  It  is  also  shown  that  once  classified, 
estimation  of  chemical  species  concentration  using  a  further  stage  of  BPNN  is  possible.  Data  for 
the  study  zure  derived  from  laboratory-based  measimements  of  UV-Vis  optical  absorption  spectra 
from  mixtures  of  common  chemical  pollutants. 

1  Introduction 

This  study  has  the  goal  of  developing  artificial  intelligence  methods,  to  analyse  UV- 
Vis  spectra  and  hence  determine  actual  chemical  species  and  their  concentrations  in 
real-time  in-line  monitor  systems.  Prior  to  the  study,  a  wide  range  of  NN  methods 
(BP,  Radial  Base,  Kohonen,  etc.),  topologies  and  iteration  conditions  were  evalu¬ 
ated  for  their  ability  to  classify  and/or  estimate  components  in  the  data.  All  these 
methods  were  unsuccessful  and  pointed  to  the  need  for  a  more  knowledge-based 
approach.  In  the  present  work  two  approaches  to  preprocessing  the  data  before 
classification  by  BPNN  are  evaluated.  The  first  method,  2nd  derivative  spectrome¬ 
try,  relies  only  on  the  spectral  data  and  if  successful  would  give  self-classifying,  self 
learning  solutions.  The  second  method,  a  modification  of  2nd  derivative  spectrom¬ 
etry  [1],  depends  on  knowledge  of  the  absorption  spectra  of  expected  constituent 
species. 

2  Experimental 

Data  for  the  study  were  obtained  from  laboratory-based  UV-Vis  optical  absorption 
spectra  measurements  taken  from  mixtures  of  three  common  chemical  pollutants 
in  water  prepared  at  three  different  concentrations.  Stock  solutions  were  prepared 
from  addition  of  Sodium  Nitrate,  Ammonia  solution  and  Sodium  Hypochlorite  to 
distilled  water  and  these  were  mixed  in  all  possible  combinations  to  provide  a  set 
of  training  data  with  64  members.  64  samples  were  also  mixed  for  a  test  set  at 
concentrations  approximately  30%  greater  than  those  for  the  training  set.  Species 
and  concentrations  are  summarised  in  Table  1.  The  apparatus  used  in  the  exper¬ 
iments  consisted  of  a  Hewlett-Packard  8452A  diode  array  spectrometer  equipped 
with  a  1  cm  quartz  cell  operated  remotely  using  proprietary  software.  All  spectro¬ 
scopic  data  were  transferred  to  computer  via  a  serial  interface  for  analysis  using 
Microsoft  Windows  based  software  including  the  Neural  Desk  v2,l  Neural  Network 
software  package[2].  UV  intensity  spectra  of  the  solution  were  recorded  from  190  to 
820  nm  with  2  nm  interval.  Absorption  spectra  were  calculated  from  the  intensity 
spectra  using  equation  (1)  with  spectra  from  distilled  water  as  a  reference.  For  data 
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TRAINING  SET 

NH3-A  =  105.83  mg/l 

CI2-A  =  49.23  mg/l 

NO3-A  =  7.75  mg/l 

NH3-B  =  32.55  mg/l 

CI2-B  =  24.62  mg/l 

NO3-B  =  3.88  mg/l 

NH3-C  =  11.18  mg/l 

CI2-C  =  6.15  mg/l 

NO3-C  =  1.94  mg/l 

TESTING  SET 

NH3-D  =  137.7  mg/l 

CI2-D  =  60.48  mg/l 

NO3-D  =  9.75  mg/l 

NH3-E  =  45.9  mg/l 

CI2-E  =  30.24  mg/l 

NO3-E  =  4.88  mg/l 

NH3-F  =  15.3  mg/l 

CI2-F  =  5.04  mg/l 

NO3-F  =  1.95  mg/l 

Table  1  Chemical  species  and  concentrations  used  to  obtain  the  data  sets. 


analysis,  data  points  4nm  apart  were  used,  giving  data  arrays  of  43  points. 

A  Z'  /ft/anik  -  darkcurrent  ^ 

\Isampie  -  darkcurrent )  ^  ^ 

Figure  1  shows  absorption  spectra  obtained  for  the  three  individual  species  and  the 
spectra  when  these  are  mixed  together.  These  spectra  are  typical  of  those  obtained 
for  this  study  and  they  exemplify  two  important  features  of  UV-Vis  spectrometry. 
Firstly,  UV-Vis  spectral  peaks  from  water  are  in  general  broad  and  overlapping, 
typically  30nm  wide  -  this  makes  them  difficult  to  discern.  Secondly,  the  dynamic 
range  in  absorption  for  the  contaminant  species  is  high.  Consequently  when  two  or 
more  species  are  mixed  at  high  concentrations  very  little  light  is  left  for  measure¬ 
ment,  and  the  signal  to  noise  level  is  reduced.  The  experimental  data  set  is  also 
subject  to  some  systematic  error  due  to  base  line  shifts  representative  of  drift  in 
the  experimental  apparatus  over  long  time  periods  (days)  and  also  because  of  unde¬ 
termined  chemistry  due  to  reactions  between  components  in  the  mixture.  However 
these  factors  should  not  limit  the  ability  to  classify  the  spectra,  only  the  ability 
to  quantify,  as  each  constituent  in  a  sample  should  result  in  its  own  distinctive 
adsorption  peak,  even  if  the  relationship  between  peak  height  and  concentration 
contains  some  error  and/or  is  non-linear. 


Figure  1  The  absorbance  spectra  of  NO3  7.75  mg/1,  NH3  105.83  mg/1  and  CI2 
49.23mg/l  and  a  mixture  of  these  (solid  line). 


3  Simulating  Data  Patterns  by  Adding  Systematic  Noise 

To  obtain  sufficient  training  and  testing  sample  sets  [3],  a  systematic  noise  charac¬ 
teristic  of  stray  light  in  the  optical  system  was  added  to  the  raw  data  in  the  range 
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0-5%  in  intervals  of  0.5%,  using  formula  (2),  in  a  similar  manner  to  the  work  of 
Gemperline  et  al  [4],  This  generated  704  training  patterns  and  704  test  patterns 

Aix(generate)  -  Aix  +  log  (2) 

where  Aix  is  the  absorption  of  the  zth  component  at  the  wavelength  A,  and  E  is 
the  fraction  of  stray  light  added. 

4  Results 

A  summary  of  conditions  for  the  BPNN  classifiers  used  after  the  preprocessing  steps 
is  given  in  Table  2.  Figure  2  shows  a  diagram  of  the  root  mean  square  training  error 
against  iteration  epoch  and  Figure  3  is  a  diagram  of  the  percentage  error  in  the 
testing  set  after  intervals  of  200  epochs.  A  minimum  in  the  error  in  Figure  3  indicates 
the  best  compromise  between  under  training  and  over  training. 


Stochastic  back-propagation: 
learning  rate  =  0.1,  alpha  =0.9 

Network  Topology 

Pre-processing 

inputs 

hidden 

outputs 

2nd  derivative  values 

21 

5 

3 

Encode  the  shape  of  2nd  derivative  spectra 

22 

5 

3 

Table  2  Summary  of  conditions  for  the  BPNN  classifiers  used  after  the  prepro¬ 
cessing  steps. 


Figure  2  Training  error  as  a  function  of  Figure  3  Testing  error  after  every  200 

epoch  for  preprocessing  with  PC  A,  2nd  training  epochs  as  a  function  of  epoch  for 

Derivative  spectra,  and  Binary  Encoded  preprocessing  with  PCA,  2nd  Derivative 

2nd  Derivative  Spectra.  spectra,  ^uld  Binary  Encoded  2nd  Deriva¬ 

tive  Spectra. 

5  2nd  Derivative  Preprocessing  with  BPNN  classification 

Following  the  procedure  of  Antonov  and  Stoyanov  [5]  whereby  the  spectra  are 
transformed  into  differential  absorption  spectra  which  increase  the  likelihood  of 
discerning  spectral  features  related  to  the  presence  of  a  species,  in  this  trial,  a  data 
set  of  704  spectra  of  21  values  of  the  second  derivative  with  respect  to  wavelength 
(d'^A/dX^)  were  prepared.  As  shown  in  Figure  2  and  Figure  3  this  scheme  is  not 
successful  at  classifying  the  data. 

6  Binary  Encoded  2nd  Derivative  Preprocessing 

The  idea  behind  binary  encoding  the  2nd  derivative  spectra  is  to  obtain  a  code 
that  represents  the  shape  of  the  spectral  data  that  is  as  independent  as  possible  of 
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the  intensity  of  the  data.  The  procedure  consists  of  binary  encoding  segments  of 
the  second  derivative  of  the  absorption  spectra  according  to  the  state  of  the  slope, 
i.e.  of  the  third  derivative.  Only  relevant  data  are  used,  that  is  data  from  around 
the  absorption  speaks  of  expected  species.  In  detail  the  binary  encoding  scheme 
consists  of:  firstly,  segmenting  the  2nd  derivative  spectra  between  190  -  350  nm  into 
10  segments  as  shown  in  Figure  3,  with  the  segment  between  206  and  224  divided 
into  two  segments  to  ensure  that  the  minimum  in  this  region  is  included;  secondly, 
encoding  the  slope  in  each  segment  with  a  2  bit  binary  code  of  01  for  decreasing,  10 
for  increasing,  11  for  convex,  and  00  for  unchanged;  and  thirdly,  using  the  resulting 
22  bit  code  as  input  to  a  classifying  BPNN  network.  The  topology  of  this  network  is 
22  input  nodes,  5  hidden  nodes,  and  3  output  nodes.  The  three  outputs  determine 
whether  Nitrate,  Chlorine,  or  Ammonia  are  present.  As  can  be  seen  in  Figures  2 
and  3,  the  classification  was  much  more  accurate  than  in  the  previous  attempt 
and  the  training  error  converged  to  a  small  value  very  quickly.  This  results  in  a 
93.75%  prediction  confidence  overall  for  classification.  However  an  error  of  6.25%  is 
obtained  for  the  case  of  a  mixture  of  Ammonia  and  Nitrate  which  is  mis-predicted  as 
a  mixture  of  Ammonia,  Nitrate  and  Chlorine.  However  as  will  be  shown  in  the  next 
section  when  this  classification  was  followed  up  by  the  third  stage  of  estimation  the 
mis-identified  chlorine  is  predicted  as  occurring  at  a  low,  insignificant  level.  Thus 
overall  the  two  stages  tend  to  cancel  out  the  error. 

7  Estimation  of  Concentration 

Following  on  from  the  binary  encoded  second  derivative  preprocessing,  the  scheme 
for  estimation  of  concentration  also  uses  a  knowledge  of  components  expected  in  the 
spectra.  Here  one  of  seven  possible  networks  which  predict  concentration  is  chosen 
depending  on  the  output  of  the  classification  network.  The  seven  networks  cover  all 
the  possibilities  of  the  1.  Chlorine,  2.  Nitrate,  3.  Chlorine  and  Nitrate,  4.  Ammonia, 
5.  Chlorine  and  Ammonia,  6.  Nitrate  and  Ammonia  and  7.  Chlorine,  Nitrate  and 
Ammonia.  These  are  described  in  the  next  section.  In  this  scheme  absorption  peak 
shape  data  for  the  various  components;  Nitrate  at  210  nm,  Hypochlorite  at  290 
nm  and  Mono  chloramine  at  245  nm  are  used  to  determine  the  input  variables  for 
the  seven  BPNNs  used  for  estimating  concentration.  The  inputs  used  are  the  7 
values  of  absorbance  centred  around  the  peak  region  of  each  species  as  depicted  in 
Table  4.  Training  and  Testing  data  for  these  BPNN  was  generated,  using  extinction 
coefficient  data  for  each  of  the  expected  components  and  varying  the  concentration 
of  each  species  over  the  ranges  shown  in  Table  3.  The  extinction  coefficients  data  sets 
required  for  this  were  obtained  by  partial  least  squares  fitting  the  absorption  spectra 
using  the  combined  original  raw  training  and  testing  data  sets.  Data  from  this  new 
training  set  was  used  with  one  of  the  following  network  topologies,  depending  on  the 
output  of  the  classification  network.  The  various  networks  were:  3  hidden  nodes  for 
7  inputs  for  the  cases  with  1  outputs  node,  4  hidden  nodes  for  14  inputs  for  the  cases 
with  2  outputs,  and  5  hidden  nodes  for  21  inputs  for  the  cases  with  3  outputs.  The 
algorithm  used  was  Stochastic  BP  with  0.1  learning  rate  and  0.9  momentum.  It  is 
worth  noting  that  these  estimation  networks  are  linear  since  they  were  trained  using 
data  from  a  linear  model  of  absorbance.  Testing  data  for  the  estimating  networks 
were  generated  by  selecting  the  values  of  absorption  half  way  between  the  values  of 
absorption  used  for  the  training  set  and  adding  5%  stray  light;  i.e.  E  =  0.05  in  (2). 
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The  training  and  testing  error  for  the  Ammonia  and  Chlorine  network  was  typical 
of  the  result  obtained  for  all  the  estimation  networks.  For  this  the  network  error 
converged  very  quickly  from  an  error  of  0.1889  at  1st  epoch  to  less  than  0.0025  after 
only  100  epochs.  The  testing  error  also  reduced  to  a  minimum  of  0.015  after  500 
training  epochs.  The  overall  accuracy  of  the  estimating  networks  is  shown  in  Table 
4.  This  figure  tabulates  the  number  of  training  patterns  and  the  resulting  error  for 
the  networks  described  in  the  text. 


Concentration  Varying  (mg/1) 

Net 

Chlorine 

Nitrate 

Ammonia 

INPUTS  Absorption  units  at 

1. 

4.5  -  40.0 

— 

— 

279-303  nm 

2. 

— 

1.2 -8.0 

— 

203-227  nm 

3. 

4.0  -  40.0 

o 

od 

1 

— 

279-303  &:  203-227  nm 

4. 

— 

— 

4.5  -  40.0 

191-219  nm 

5. 

4.0  -  40.0 

— 

4.0-38.0 

191-219  k  231-255  k  279-303  nm 

6. 

— 

1.2 -8.0 

4.0  -  40.0 

191-227  nm 

7. 

5.0  -  40.0 

1.0 -8.0 

5.0  -  40.0 

191-227  k  231-255  k  279-303  nm 

Table  3  Concentration  range  of  each  species  used  in  generating  traiining  patterns. 
l.Chlorine  (7  inputs)  2.  Nitrate  (7  inputs)  3.  Chlorine  and  Nitrate  (14  inputs)  4. 
Ammonia  (7  inputs)  S.Chlorine  and  Ammonia  (21  inputs)  6.  Nitrate  and  Ammonia 
(9  inputs)  7.  Chlorine,  Nitrate  and  Ammonia  (23  inputs) . 


ALGORITHM 

Stochastic  BP  with  0.1  learning 
rate  and  0.9  momentum 

Network 

patterns 

Result  Error 

1.  Chlorine 

356 

0.03% 

2.  Nitrate 

137 

0.05% 

3.  Chlorine  and  Nitrate 

350 

0.18%  k  0.20% 

4.  Ammonia 

356 

5.  Chlorine  and  Ammonia 

350 

0.72%  k  0.17% 

6.  Nitrate  and  Ammonia 

350 

0.68%  k  0.56% 

7.  Chlorine,  Nitrate  and  Ammonia 

512 

0.13%  k  1.72%  k  1.03% 

Table  4  Network’s  Topology  and  Performance. 


8  Conclusions 

In  laboratory-based  trials,  2nd-derivative  analysis  is  found  to  be  an  ineffective  pre¬ 
processing  method  in  BPNN  classification  of  UV-Vis  absorption  from  water  sam¬ 
ples.  This  follows  on  from  earlier  work  in  which  various  NN  algorithms  including 
BPNN  were  evaluated  for  classification  and  estimation  and  also  found  ineffective. 
Subsequently  a  more  knowledge-based  approach  has  been  formulated  which  restricts 
itself  to  determining  the  presence  and  concentration  of  a  range  of  expected  species. 
The  scheme  involves  a  three  stage  process.  In  the  first  stage  shape  information 
is  derived  by  binary  encoding  segments  of  the  second  derivative  of  the  absorp- 
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Figure  4  Absorption  spectra  and  1st  and  2nd  derivative  spectra  for 
monochloroamine  and  monochloroamine  plus  nitrate. 


tion  spectra  according  to  their  shape.  The  rationale  of  this  stage  is  to  reduce  the 
spectral  information  to  shape  sensitive  factors.  This  is  found  significantly  to  ease 
classification  of  the  spectra  by  a  second  stage  of  BPNN  analysis.  For  estimation  of 
concentration  of  species  absorption  data  for  the  expected  species  is  used  to  train 
a  second  stage  of  BPNN,  segmentation  of  the  spectra  and  selection  of  relevant  in¬ 
puts  for  the  second  stage  BPNN  is  determined  from  the  absorption  data  for  the 
expected  species,  to  give  the  best  segmentation  pattern  and  minimum  number  of 
network  inputs.  The  two-step  approach  taken  to  classification  and  then  estimation 
is  better  than  a  one  step  approach.  The  first-step  network  specifies  which  species 
are  likely  to  occur  and  the  second-step  network  can  then  focus  on  a  few  inputs  that 
strongly  correlate  with  the  presence  of  the  expected  species.  Also  the  second-step 
provides  a  filter  that  compensates  for  classification  of  species  at  low  concentration 
levels  or  incorrect  identification  of  species  due  to  low  level  signals  with  noise. 

REFERENCES 

[1]  Sommer,  L.,  Analytical  absorption  spectrophotometry  in  the  visible  and  ultraviolet:  the  prin¬ 
ciples,  (1989)  Elsevir. 

[2]  Neural  Desk:  User’s  Guide,  Neural  Computer  Sciences,  (1992). 

[3]  Hammerstrom,  D.  M.,  Working  with  neural  networks,  IEEE  Spectrum,  Jvdy  (1993),  pp46-53. 

[4]  Gemperline,  P.  J.,  Long,  J.  R.  and  Gregoriou,  V.  J.,  Nonlinear  Multivarate  Calibration  Using 
Principal  Components  Regression  and  Artificial  Neural  Networks,  Anal.  Chem.,  VoL  63  (1991), 
PP2313-2323. 

[5]  Antonov,  L.  and  Stoyanov,  S.,  Analysis  of  the  Overlapping  Bands  in  UV-Vis  Absorption 
spectroscopy,  Applied  Spectroscopy,  Vol.  47  (1993),  no.  7,  ppl030-1035. 


A  NON-EQUIDISTANT  ELASTIC  NET  ALGORITHM 

Jan  van  den  Berg  and  Jock  H»  Geselschap 

Department  of  Computer  Science, 
Erasmus  University  Rotterdam,  The  Netherlands. 

Email:  jvandenberg@few.eur.nl 

The  statistical  mechanical  derivation  by  Simic  of  the  Elastic  Net  Algorithm  (ENA)  from  a  stochas¬ 
tic  Hopfield  neural  network  is  criticized.  In  our  view,  the  ENA  should  be  considered  a  dynamic 
penalty  method.  Using  a  linear  distance  measure,  a  Non- equidistant  Elastic  Net  Algorithm  (NENA) 
is  presented.  Finally,  a  Hybrid  Elastic  Net  Algorithm  (HENA)  is  discussed. 

1  Stochastic  Hopfield  and  Elastic  Neural  Networks 

Hopfield  introduced  the  idea  of  an  energy  function  into  neural  network  theory  [5] . 
Like  Simic  [8],  we  use  Hopfield ’s  energy  expression  multiplied  by  -1,  i.e. 

E(S)  =  i  ^  WiiSiSj  +  ^  liSi,  (1) 

ij  i 

where  S  6  {0, 1}”  and  all  Wij  >  0.  Making  the  units  stochastic,  the  network  can  be 
analyzed  applying  statistical  mechanics.  We  concentrate  on  the  free  energy  [6,  6] 

F  =  {E{S))  -  TS,  (2) 

where  T  —  1//5  is  the  temperature,  where  {E{S))  represents  the  average  energy,  and 
where  S  equals  the  so-called  entropy.  A  minimum  of  F  corresponds  to  a  thermal 
equilibrium  state  [6].  We  shall  apply  the  next  theorem  [10,  11]: 


Theorem  1  In  mean  field  approximation,  the  free  energy  F^  of  constrained  stochas¬ 
tic  binary  Hopfield  networks,  submitted  to  the  constraint  Si  =  I  equals 


fc(V)  =  -  iln[^exp(-/?(^tDyV^'  +  /,))]. 

ij  i  j 

The  stationary  points  of  F^  are  found  at  state  space  points  where 

exp(-/J(Ej  +  -fi)) 


Vi  =  P(Si  =  1  A  Sj  =  0)  = 


E,  exp(-/?(Ej  wijVj  +  /;))  ■ 


(3) 

(4) 


Let  Sp  denote  whether  the  salesman  at  time  i  occupies  space-point  p  or  not  (S'*  =  1 
or  0).  Then  the  corresponding  Hamiltonian  may  be  stated  cis  [8] 

i  pq  i  pq 

The  first  term  represents  the  sum  of  distance-squares,  the  second  term  penalizes 
the  simultaneous  presence  of  the  salesman  at  more  than  one  position.  The  other 
constraints,  which  should  guarantee  that  every  city  is  visited  once,  can  be  built-in 
‘strongly’  using  Vi  :  ^ij  —  1-  Eventually,  one  finds  [8,  12]  the  free  energy 

f'.sp(v)  = 

i  pq  i  pq 

i  ^  In  [  ^  exp(- 1  ^  4,  (a v;  +  +  V,'-'))] .  (6) 

p  i  q 
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On  the  other  hand,  the  ‘elastic  net’  algorithm  [2]  has  the  energy  function 

£'an(x)  =  ^  XI  I  1^  ^  ~ 

*  P  j 

Here,  x*  represents  the  z-th  elastic  net  point  and  Xp  represents  the  location  of  city  p. 
Application  of  gradient  descent  on  (7)  yields  the  updating  rule: 

Ax'  =  ^(x*+^  -  2x*  +  x*“^)  +  ^  A^(z)(xp  -  x'),  (8) 


where  A^{i)  =  exp(-^  |  Xp  ~  x'  P)/X)/®xp(-^  |  Xp  -  x^  P)  and  where  the 
time-step  At  =  1//3  equals  the  current  temperature  T. 

2  Why  the  ENA  is  a  Dynamic  Penalty  Method 

Three  objections  against  Simic’s  derivation  of  (7)  (with  =  0-2  =  1)  from  (6)  are 
given.  To  derive  a  free  energy  expression  in  the  standard  form  (2),  Simic  applies  a 
Taylor  series  expansion  on  the  last  term  of  (6).  We  try  to  do  the  same.  Taking 

/W  =  X'”  (9) 

P  i 

<  =  =  (10) 
?  9 

and  using  (4)  (adapted  to  the  TSP,  with  a  >  1),  we  obtain  [12] 

/(a  +  h)  =  Xl“[X®’‘P(4)] +Xl'p^(“p)  +  ‘^(1‘^)  (11) 

P  *  ip  ^'^P 

^  X 1“  X  ®^p  ( -  X  -  f  X  E  +  ^r")- 

pi  q  i  pq 

Substitution  of  this  result  in  (6)  yields: 

^'app(v)  =  i  X  E  +  n-')  -  f  E  E  - 

i  P<1  i  pq 

?E’”E“p(-'^f  E'ImI'/)-  (12) 

P  i  9 

Objection  1.  Since  is  proportional  to  /?,  the  Taylor- approximation  (11)  does 
not  hold  for  high  values  of  /?.  This  is  a  fundamental  objection  because  during  the 
execution  of  the  ENA,  is  increased  step  by  step.  □ 


Next,  Simic  performs  a  ‘decomposition  of  the  particle  (salesman)  trajectory’: 

x‘'=  <x(!)>  =X*p<'S'p>  =  E^pl?-  (13) 

P  P 

x(z)  is  the  stochastic  and  x*  the  average  position  of  the  salesman  at  time  i.  Using 
the  decomposition,  Simic  writes  =|  Xp  -  x*  ^  .  By  this,  he  makes 

a  crucial  transformation  from  a  linear  function  in  into  a  quadratic  one  in  x*. 
Substitution  of  the  result  in  (12)  (with  o-  ==  neglect  of  the  second  term,  and 
application  of  the  decomposition  (13)  on  the  first  term  of  (12)  finally  yield  (7). 

Objection  2.  Careful  analysis  [12]  shows  that  in  general 

E  =  E(’^p-x?)^i?  ¥=  I  xp-x’  p ,  □ 

9  9 
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Energy  (6)  is  a  special  case  of  a  generalized  free  energy  of  type  (3),  whose  sta¬ 
tionary  points  are  solutions  of 

Whatever  is  the  temperature,  these  stationary  points  are  found  at  states  where  on 
average,  all  strongly  submitted  constraints  are  fulfilled.  Moreover,  stationary  points 
of  a  free  energy  of  type  (3)  are  often  maxima  [11,  12]. 

Objection  3.  An  analysis  of  the  free  energy  of  the  ENA  (section  3)  yields  a  very 
different  view:  both  terms  of  (7)  create  a  set  of  minima.  A  competition  takes  place 
between  feasibility  and  optimality,  where  the  current  temperature  determines  the 
overall  effect.  This  corresponds  to  the  classical  penalty  method.  A  difference  from 
that  approach  is  that  here  -  like  in  the  Hopfield- Lagrange  model  [9]  -  the  penalty 
weights  change  dynamically.  Consequently,  we  consider  the  ENA  a  dynamic  penalty 
method.  C 

The  last  observation  corresponds  to  the  theory  of  so-called  deformable  templates 
[7,  13],  where  the  corresponding  Hamiltonian  equals 

£;dt(S,  X)  =  f  I  X-+1  -  x‘  p  + 1]  I  Xp  -  x^-  p  .  (14) 

i  pj 

A  statistical  analysis  [7,  13]  of  E^t  yields  the  free  energy  (7).  A  comparison  between 
(7)  and  (14)  clarifies  that  the  first  energy  expression  is  derived  from  the  second  one 
by  adding  noise  exclusively  to  the  penalty  terms. 

3  An  Analysis  of  the  ENA 

We  can  analyze  the  ENA  by  inspection  of  the  energy  landscape  [3,  12].  The  general 
behavior  of  the  algorithm  leads  to  large-scale,  global  adjustments  early  on.  Later  on, 
smaller,  more  local  refinements  occur.  Equidistance  is  enforced  by  the  first,  ‘elastic 
ring  term’  of  (7),  which  corresponds  to  parabolic  pits  in  the  energy  landscape. 
Feasibility  is  enforced  by  the  second,  ‘mapping  term’  corresponding  to  pits  whose 
width  and  depth  depend  on  T.  Initially,  the  energy  landscape  appears  to  shelve 
slightly  and  is  lowest  in  regions  with  high  city  density.  On  lowering  the  temperature 
a  little,  the  mapping  term  becomes  more  important:  it  creates  steeper  pits  around 
cities.  By  this,  the  elcistic  net  starts  to  stretch  out.  We  next  consider  two  potential, 
nearly  final  states  of  a  problem  instance  with  5  permanently  fixed  cities  and  5 
variable  elastic  net  points,  of  which  4  are  temporarily  fixed.  The  energy  landscape 
of  the  remaining  elastic  net  point  is  displayed.  Figure  1  shows  the  case  where  4  of 
the  5  cities  have  already  caught  an  elastic  net  point. 


Figure  1  The  energy  landscape,  a  non-  Figure  2  The  energy  landscape,  an  al- 

feasible  state.  most  feasible  state. 
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The  landscape  of  the  5-th  ring  point  exhibits  a  large  pit  situated  above  the  only 
non- visited  city.  If  the  point  is  not  too  far  away  from  the  non-visited  city,  it  can  still 
be  caught  by  it.  It  demonstrates,  that  a  too  rapid  lowering  of  the  temperature  may 
lead  to  a  non- valid  solution.  In  figure  2,  an  almost  feasible  final  solution  is  shown, 
where  3  net  points  coincide  with  3  cities.  A  4-th  elastic  net  point  is  precisely  in  the 
middle  between  the  two  close  cities.  Now,  the  mapping  term  only  produces  some 
small  pits.  The  elastic  net  term  has  become  perceptible  too.  Hence,  the  remaining 
elastic  net  point  is  most  probably  forced  to  the  middle  of  its  neighbors  making  the 
final  state  more  or  less  equidistant,  but  not  feasible!  Thus,  it  is  possible  to  end  up 
in  a  non-feasible  solution  if  (a)  the  parameter  T  is  lowered  too  rapidly^  or  if  (b) 
two  close  cities  have  caught  the  same  net  point. 

4  Alternative  Elastic  Net  Algorithms 

In  order  to  use  a  correct  distance  measure  and  at  the  same  time,  to  get  rid  of  the 
equidistance  property,  we  adopt  a  linear  distance  measure  in  (7): 

Flin(x)  =  ^  I  X'+'  -  X-  I  ^lny]exp(^  |  x,  -  x^  |=).  (15) 

*  P  j 

Applying  gradient  descent,  the  corresponding  motion  equations  are  found  [3].  A 
self-evident  analysis  shows  that,  like  in  the  original  ENA,  the  elastic  net  forces  try 
to  push  elastic  net  points  onto  a  straight  line.  There  is,  however,  an  important 
difference:  once  a  net  point  is  situated  in  any  point  on  the  straight  line  between  its 
neighboring  net  points,  it  no  longer  feels  an  elastic  net  force.  Equidistance  is  not 
pursued  anymore  and  the  net  points  have  more  freedom  to  move  towards  cities.  We 
therefore  conjecture  that  the  NENA  will  find  feasible  solutions  more  easily.  Since 
the  elastic  net  forces  are  normalized  by  the  new  algorithm,  a  tuning  problem  arises. 
To  solve  this  problem,  all  elastic  net  forces  are  multiplied  by  a  same  factor.  The 
final  updating  rule  becomes: 


f  ~  X*  X*  ^  -  X*  \ 

\  I  X*  +  l  -  X*  I  I  x*-i  -  X*  I J 


p 


Finally,  we  merged  the  ENA  and  the  NENA  into  a  hybrid  one  (HENA):  the  al¬ 
gorithm  starts  using  ENA  (to  get  a  balanced  stretching  out)  and,  after  a  certain 
number  of  steps,  switches  to  NENA  (to  try  to  guarantee  feasibility). 


^In  optimal  annealing[l] ,  the  temperature  is  decreased  carefully  to  escape  from  local  minima. 
Instead,  here  this  is  done  to  end  up  in  a  local  (i.e.,  a  constrained)  minimum. 
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5  Experiments 

We  started  using  the  5-city  configuration  of  section  3.  Using  5  up  to  12  elastic  net 
points,  the  ENA  produced  only  non-feasible  solutions.  Using  15  elastic  net  points, 
the  optimal  feasible  solution  is  always  found.  Using  5  elastic  net  points,  the  NENA 
occasionally  produced  the  optimal  solution.  A  gradual  increase  of  the  number  of 
elastic  net  points  results  into  a  rise  of  the  percentage  of  optimal  solutions  found. 
Using  only  10  elastic  net  points,  we  obtained  a  100%  score.  Testing  a  15-city- 
problem,  we  had  the  similar  experiences.  However,  the  picture  started  to  change 
having  30-city  problem  instances.  As  a  rule,  both  algorithms  are  equally  disposed 
to  find  a  valid  solution,  but  the  quality  of  the  solutions  of  the  original  ENA  is 
generally  better.  Trying  even  larger  problem  instances,  the  NENA  more  frequently 
found  a  non- valid  solution:  inspection  shows  a  strong  lumping  effect  of  net  points 
around  cities  and  sometimes  a  certain  city  is  completely  left  out.  At  this  point,  the 
hybrid  approach  of  HEN  A  comes  to  mind.  Up  to  100  cities,  we  were  unable  to  find 
parameters  which  yield  better  solutions  than  the  original  ENA. 

6  Conclusions  and  Outlook 

Elastic  neural  networks  are  dynamic  penalty  methods^  therefore  always  having  a 
tuning  problem.  Contrary  to  simulated  annealing,  the  network  should  end  up  in 
a  local,  constrained  minimum.  Trying  the  ENA,  it  may  come  up  with  a  non-valid 
solution  if  two  cities  are  close  to  each  other.  To  guarantee  feasibility  more  easily,  we 
implemented  a  new  algorithm,  having  a  linear  distance  measure.  The  success  of  it  is 
limited  to  small  problem  instances,  showing  that  the  quadratic  distance  measure  is 
an  essential  ingredient  of  the  original  ENA.  Trying  a  hybrid  algorithm,  we  did  not 
find  parameters  which  yield  a  better  performance.  In  future  research,  an  alternative 
for  HEN  A  can  be  considered  by  realizing  a  gradual  switch  from  the  ENA  to  the 
NENA.  Likewise,  other  formulations  of  penalty  terms  can  be  tested. 
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This  paper  deals  with  optimal  learning  and  provides  a  imified  viewpoint  of  most  significant  results 
in  the  field.  The  focus  is  on  the  problem  of  local  minima  in  the  cost  function  that  is  likely  to  affect 
more  or  less  any  learning  algorithm.  We  give  some  intriguing  links  between  optimal  learning  and 
the  computational  complexity  of  loading  problems.  We  exhibit  a  computational  model  such  that 
the  solution  of  all  loading  problems  giving  rise  to  unimodal  error  fimctions  require  the  same  time, 
thus  suggesting  that  they  belong  to  the  s^lme  computational  class. 

Keywords:  Backpropagation,  computational  complexity,  optimal  learning,  premature  satmation, 
spurious  and  structural  local  minima,  terminal  attractor. 

1  Learning  as  Optimisation 

Supervised  learning  in  multilayered  networks  (MLNs)  can  be  accomplished  thanks 
to  Backpropagation  (BP),  which  is  used  to  minimise  pattern  misclcissifications  by 
means  of  gradient  descent  for  a  particular  nonlinear  least  squares  fitting  problem. 
Unfortunately,  BP  is  likely  to  be  trapped  in  local  minima  and  indeed  many  examples 
of  local  extremes  have  been  reported  in  the  literature. 

The  presence  of  local  minima  derives  essentially  from  two  different  reasons.  First, 
they  may  arise  because  of  an  unsuitable  joint  choice  of  the  functions  which  defines 
the  network  dynamics  and  the  error  function.  Second,  local  minima  may  be  inher¬ 
ently  related  to  the  structure  of  the  problem  at  hand.  In  [5],  these  two  cases  have 
been  referred  to  as  spurious  and  structural  \ocdl  minima,  respectively.  Problems  of 
sub-optimal  solutions  may  also  arise  when  learning  with  high  initial  weights,  as  a 
sort  of  premature  neuron  saturation  arises,  which  is  strictly  related  to  the  neuron 
fan-in.  An  interesting  way  of  facing  this  problem  is  to  use  the  “relative  cross-entropy 
metric’^  [10],  for  which  the  erroneous  saturation  of  the  output  neurons  does  not  lead 
to  plateaux,  but  to  very  high  values  of  the  cost.  When  using  the  cross-entropy  met¬ 
ric,  the  repulsion  from  such  configurations  is  much  more  effective,  and  underflow 
errors  are  likely  to  be  avoided. 

There  have  also  been  attempts  to  provide  theoretical  conditions  aimed  at  guarantee¬ 
ing  local  minima  free  error  surfaces.  So  far,  however,  only  some  sufficient  conditions 
have  been  identified  that  give  rise  to  unimodal  error  surfaces.  Examples  are  the  the 
case  of  pyramidal  networks  [8],  commonly  used  in  pattern  recognition,  radial  ba¬ 
sis  function  networks  [2],  and  non-linear  autoassociators  [3].  The  identification  of 
similar  conditions  ensures  global  optimisation  just  by  using  simple  gradient  descent. 
Instead  of  looking  for  local  algorithms  like  gradient  descent,  techniques  that  guar¬ 
antee  global  optimisation  may  be  explored.  Of  course,  one  of  the  main  problems 
to  face  is  that  most  interesting  tasks  give  rise  to  the  optimisation  of  functions 
with  even  several  thousand  variables.  This  makes  it  very  unlikely  that  most  classic 
approaches  [11]  can  be  directly  and  successfully  applied.  Instead,  the  proposal  of 
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successful  algorithms  has  to  face  effectively  the  curse  of  dimensionality  typical  of 
most  interesting  practical  problems. 

Statistical  training  methods  have  been  previously  proposed  in  order  to  alleviate  the 
local  convergence  problem.  These  methods  introduce  noise  to  connection  weights 
during  training,  but  suffer  from  extremely  slow  convergence  due  to  their  probabilis¬ 
tic  nature. 

Several  numerical  algorithms  for  global  optimisation  have  also  been  presented,  in 
which  BP  is  revisited  from  the  viewpoint  of  dynamical  system  theory.  Barhen  et 
al.  [1]  have  proposed  the  TRUST  algorithm  (for  Terminal  Reseller  Unconstrained 
Subenergy  Tunneling)  that  formulates  global  optimisation  as  the  solution  of  a  sys¬ 
tem  of  deterministic  differential  equations,  where  E{W)  is  the  function  to  be  op¬ 
timised,  while  the  connection  weights  are  the  states  of  the  system.  The  dynamics 
used  is  achieved  upon  application  of  the  gradient  descent  to  a  modified  cost  which 
transforms  each  encountered  local  minimum  into  a  maximum,  so  that  the  gradi¬ 
ent  descent  can  escape  from  it  to  a  lower  valley,  A  related  algorithm,  called  Magic 
Hair- Brushing,  has  been  proposed  in  [6].  The  system  dynamics  is  now  modified 
through  a  deformation  of  the  gradient  field  for  eliminating  the  local  minima,  while 
preserving  the  global  structure  of  the  function.  All  these  algorithms  exhibit  a  good 
performance  in  many  practical  cases  but,  unfortunately,  their  optimal  convergence 
is  not  formally  guaranteed,  unless  starting  from  a  “good^’  initial  point. 

2  The  Class  of  Unimodal  Loading  Problems 

Most  experiments  with  multilayer  perceptrons  and  BP  are  performed  in  a  sort  of 
magic  atmosphere  where  data  are  properly  supplied  to  the  network  which  begins 
learning  without  knowing  whether  or  not  the  experiment  will  be  successful  either 
in  terms  of  optimal  convergence  and  generalisation.  A  trial  and  error  scheme  is 
usually  employed,  aimed  at  adjusting  the  architecture  in  subsequent  experiments 
so  as  to  meet  the  desired  requirements.  To  some  extent,  this  way  of  performing 
experiments  is  inherently  plagued  by  the  suspect  that  the  used  numerical  optimi¬ 
sation  algorithm  might  fail.  Moreover,  though  optimal  learning  may  be  attained 
with  networks  having  a  growing  number  of  hidden  neurons  [14],  the  generalisation 
to  new  examples  is  not  guaranteed.  The  intuitive  feeling  that,  in  order  to  obtain  a 
good  convergence  behaviour,  generalisation  must  be  sacrificed,  may  be  effectively 
formalised  in  a  sort  of  ^^uncertainty  principle  of  learning''^  in  which  the  variable 
representing  optimal  convergence  and  generalisation  are  like  conjugate  variable  in 
Quantum  Mechanics  [7].  These  potential  sources  of  failure  of  learning  algorithms 
give  rise  to  a  sort  of  suspiciousness  that  turns  out  to  be  the  unpleasant  compan¬ 
ion  of  every  experiment.  This  seems  to  be  inter  wound  with  the  ambitious  task  of 
learning  too  general  functions. 

Let  us  focus  on  the  complexity  issues  related  to  the  loading  of  the  weights  indepen¬ 
dently  of  the  consequent  generalisation  to  new  examples.  This  makes  sense  once  a 
consistent  formulation  of  the  learning  problem  in  terms  of  both  the  chosen  examples 
and  the  neural  architecture  was  provided.  We  address  the  problem  of  establishing 
the  computational  requirements  of  special  cases  in  which  the  loading  of  the  weights 
can  be  expressed  in  terms  of  optimisation  of  unimodal  error  functions. 
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2.1  Canonical  Form  of  Gradient  Descent  Learning 

Let  us  consider  the  following  learning  equation: 

^  =  -yVwE  =  fit,W),  (1) 

where  is  the  cost  function  and  W  e  IR’”  is  the  weight  vector.  Let  us  choose 

a  non-negative  continuous  function  of  E.  Based  on  this  choice 
of  the  learning  rate,  the  dynamics  of  the  error  function  becomes 

f  =  (V,Kff  =  (V.-Kf  (- -  -.(£),  (2) 


which  makes  the  cost  function  continuously  decreasing  to  zero.  Those  configurations 
for  which  VwB  =  0  are  singular  points  that  attract  the  learning  trajectory  [4]. 
Special  cases  of  this  reduction  to  a  canonical  structure,  where  the  learning  is  forced 
by  function  '9  and  is  independent  of  the  problem  at  hand,  have  been  explored 
in  the  literature.  White  [13]  has  suggested  to  introduce  a  varying  learning  rate  so 
that  the  error  dynamics  evolves  following  the  equation  ^  =  — ^(F)  ==  —aE,  a  >  0, 
whose  solution  is  a  decaying  exponential  such  that  reaching  the  E  =  0  attractor  will 
theoretically  take  infinite  time.  In  practice,  this  may  not  necessarily  be  a  problem, 
as  the  attractor  may  be  approached  sufficiently  close  in  a  reasonable  amount  of  time, 
even  if,  for  ill-conditioned  systems,  it  can  still  be  prohibitive  to  reach  a  satisfactory 
solution.  Unfortunately,  feedforward  neural  networks  do  often  result  in  dynamical 
systems  that  are  ill-conditioned  or  mathematically  stiff  and  thus  the  convergence 
is  generally  very  slow. 

In  [12]  the  canonical  reduction  of  equation  (2)  is  based  on  choosing  '^{E)  =  E^, 
0  <  jfc  <  1,  which  leads  to  an  error  dynamics  based  on  the  differential  equation 
^  having  a  singularity  at  =  0  violating  the  Lipschitz  condition.  If 

Fo  >  0  is  the  initial  value  of  the  cost,  then  the  closed  form  solution  is  E(t)  = 

(EI~^  —  (1  -  k)t)^ ,  t  <te,  where  tg  =  (Fig.  la).  In  the  finite  time  fg  the 
transient  beginning  from  £'(0)  =  Eq  reaches  the  equilibrium  point  F?  =  0,  which  is 
a  “terminal  attractor’^ 

In  this  paper,  we  are  interested  in  finding  terminal  attractors  and,  particularly, 
in  minimising  the  time  te  required  to  approach  the  optimal  solution.  The  choice 
^(F)  =  r)  fulfills  our  needs.  Consequently  te  =  Eo/g  and,  in  particular,  when 
selecting  g  =  Eq/ct^  the  terminal  attractor  is  approached  for  t^  =  a  (Fig.  lb), 
independently  of  the  problem  at  hand,  while  the  corresponding  weight  updating 
equation  becomes 

dW  Eo  VE 

dt  o-  ||VF||2-  ^  ^ 

This  way  of  forcing  the  dynamics  leads  to  establish  intriguing  links  between  the 
concept  of  unimodal  problems  and  their  computational  complexity.  In  fact,  learning 
attractors  in  finite  time  is  not  only  useful  from  a  numerical  point  of  view  but,  in 
the  light  of  the  considerations  on  the  canonical  equations  (2),  is  interesting  for  the 
relationships  that  can  be  established  between  different  problems  giving  rise  to  local 
minima  free  cost  functions. 

2.2  Computational  analyses 

Let  us  introduce  the  following  classes  of  loading  problems  [9] . 
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Figure  1  Terminal  at  traction  using  (a)  $(E)  =  ,0  <  k  <  1,  and  (b)  '^{E)  =  77. 


Definition  1  A  loading  problem  P  is  unimodal,  P  G  UVf  provided  that  there  exists 
an  associated  unimodal  error  function  E{P,  W)  whose  optimisation  represents  the 
solution  of  P. 

Note  that  a  given  loading  problem  can  be  approached  in  the  framework  of  optimi¬ 
sation  using  different  error  functions.  For  example,  the  loading  of  the  weights  in 
a  multilayer  perceptron  using  linearly-separable  patterns  may  led  to  sub-optimal 
solution  when  choosing  error  functions  where  the  targets  are  different  from  the 
asymptotical  values  of  the  squashing  functions.  Nevertheless,  it  is  always  possible 
to  get  rid  of  these  spurious  local  minima  and  provide  a  formulation  based  on  a  local 
minima  free  error  function. 

In  order  to  evaluate  the  computational  cost  for  learning  problems  belonging  to  UV 
it  is  convenient  to  refer  to  the  parallel  computational  model  offered  by  differential 
equation  (1).  We  assume  that  there  exists  a  continuous  system  implementing  this 
differential  equation  and  then  consider  the  following  computational  class. 

Definition  2  Let  us  consider  the  class  of  loading  problems  P  for  which  there  exists 
a  formulation  according  to  the  differential  equation  (1)  such  that'ir  >  0  the  loading 
of  the  weights  ends  in  a  finite  time  tg  :te  <  r.  This  class  is  referred  to  as  the  class 
of  finite  time  loading  problems  and  is  denoted  by  PT. 

Because  of  the  previous  analysis  on  gradient  descent  the  following  result  holds. 
Theorem  3  UV  C  PT. 

Proof  If  P  G  UV  then  one  can  always  formulate  the  loading  according  to  differential 
equation  (3)  and  the  gradient  descent  is  guaranteed  not  to  get  stuck  in  local  minima. 
Because  of  equation  (2)  the  learning  process  ends  for  tg  =  a.  Hence,  Vr  >  0,  if  we 
choose  cr  <  r  we  conclude  that  P  G  PT.  □ 

This  theoretical  result  should  not  be  overvalued  since  it  is  based  on  a  computational 


Figure  2  The  class  of  iinimodal  learning  problem  can  be  learned  in  constant  time. 


model  that  does  not  care  of  problems  due  to  limited  energy.  When  choosing  r 
arbitrarily  small,  the  slope  of  the  energy  in  Fig.  lb  goes  in  fact  to  infinite. 

One  may  wonder  whether  problems  can  be  found  in  TT  that  are  not  in  lAV  (see 
Fig.  2).  This  does  not  seem  easy  to  establish  and  is  an  open  problem  that  we  think 
deserves  further  attention. 

3  Conclusions 

The  presence  of  local  minima  does  not  necessarily  imply  that  a  learning  algorithm 
will  fail  to  discover  an  optimal  solution,  but  we  can  think  of  their  presence  as  a 
boundary  beyond  which  troubles  for  any  learning  technique  are  likely  to  begin. 

In  this  paper  we  have  proposed  a  brief  review  of  results  dealing  with  optimal  learn¬ 
ing,  and  we  have  discussed  of  problems  of  sub-optimal  learning.  Most  importantly, 
when  referring  to  a  continuous  computational  model,  we  have  shown  that  there  are 
some  intriguing  links  between  computational  complexity  and  the  absence  of  local 
minima.  Basically  all  loading  problems  that  can  be  formulated  as  the  optimisation 
of  unimodal  functions  are  proven  to  belong  to  a  unique  computational  class.  Note 
that  this  class  is  defined  on  the  basis  of  computational  requirements  and,  therefore, 
seems  to  be  of  interest  independently  of  the  neural  network  context  in  which  it  has 
been  formulated. 

We  are  confident  that  these  theoretical  results  open  the  doors  for  more  thoroughly 
analyses  involving  discrete  computations,  that  could  shed  light  on  the  computa¬ 
tional  complexity  based  on  ordinary  models  of  Computer  Science. 
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1  Introduction 

In  this  paper  we  mainly  discuss  the  mapping  of  a  linear  tree  classifier  (LTC)  onto  a 
feedforward  neural  net  classifier  (NNC)  with  one  hidden  layer.  According  to  Park 
[9]  such  a  mapping  results  in  a  faster  convergence  of  the  neural  net  and  in  avoiding 
local  minima  in  network  training.  In  general  these  mappings  are  also  interesting 
because  they  determine  an  appropriate  architecture  of  the  neural  net.  The  LTC 
used  here  is  a  hierarchical  classifier  that  employs  linear  functions  at  each  node 
in  the  tree.  For  the  construction  of  decision  trees  we  refer  to  [10,  5,  12].  Several 
authors  [11,  4,  9]  discuss  the  mapping  of  an  LTC  onto  a  feedforward  net  with  one 
or  two  hidden  layers,  see  also  [3,  2].  A  discussion  of  a  mapping  onto  a  net  with  two 
hidden  layers  can  be  found  in  Sethi  [11]  and  Ivanova&Kubat  [4].  A  mapping  onto  a 
net  with  one  hidden  layer  is  discussed  in  Park  [9],  In  his  approach  the  mapping  is 
based  on  representing  the  convex  regions  induced  by  an  LTC  by  linear  membership 
functions.  However,  in  Park  [9]  no  explicit  expression  for  the  coefficients  of  the 
membership  functions  is  given.  These  coefficients  depend  on  a  parameter  p  that 
in  his  paper  has  to  be  supplied  by  the  user.  In  section  2  we  show  that  in  general 
it  is  not  possible  to  find  linear  membership  functions  that  represent  the  convex 
regions  induced  by  an  LTC.  It  is  however  possible  to  find  subregions  that  can  be 
represented  by  linear  membership  functions.  We  derive  explicit  expressions  for  the 
aforementioned  parameter  p,  in  section  3.  This  makes  it  possible  to  control  the 
approximation  of  the  convex  regions  by  membership  functions  and  therefore  of 
the  initialisation  of  the  neural  net.  In  section  4  we  also  briefly  discuss  the  use  of 
LVQ-networks  [6,  7]  for  such  an  initialisation. 

2  Non-existence  of  Linear  Membership  Functions 

Suppose  we  are  given  a  multivariate  decision  tree  {J\f,  C,  V).  In  this  notation,  J\f  is 
the  set  of  nodes  of  the  tree,  C  is  the  set  of  leaves  of  the  tree  and  D  is  the  set  of 
linear  functions  dk  :  IR”  any  node  k  of  the  tree,  the  linear  function 

dk  is  used  to  decide  which  branch  to  take.  Specifically,  we  go  left  if  dk{x)  >  0,  right 
if  dk{x)  <  0,  see  figure  1.  A  decision  tree  induces  a  partitioning  of  IR”:  each  leaf 
i  corresponds  with  a  convex  region  Ri,  which  consists  of  all  points  x  e  IR",  that 
get  assigned  to  leaf  £  by  the  decision  tree.  For  example,  region  R5  consists  of  all 
a?  e  IR”  with  di(x)  <  0  and  ^3(3;)  <  0. 
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Figure  1  The  convex  regions  induced  by  a  cleissification  tree. 


We  will  now  discuss  the  idea  of  linear  membership  functions  to  represent  the  convex 
regions  induced  by  an  LTC,  and  we  show  that  these  functions  are  in  general  not 
possible. 

In  [9]  the  following  ’theorem’  is  given  without  proof,  though  supplied  with  heuristic 
reasoning  for  its  plausibility,  see  also  equation  2  in  the  next  section: 

Conjecture  (Park[9])  For  every  decision  tree  (A/",  £,  D)  there  exists  a  set  of  linear 
membership  functions  M  =  {mi,£  e  £},  such  that  for  any  e  C,  with  i  ^  F: 

mt{x)>  mii[x),'ix  e  Rt.  (1) 

According  to  [9]  the  coefficients  of  a  linear  membership  function  can  be  used  to 
initialise  the  weights  of  the  connections  of  a  hidden  unit  with  the  input  layer. 
For  example,  the  decision  tree  in  figure  1  can  be  mapped  onto  a  one-hidden-layer 
network  with  3  inputs,  5  hidden  units  and  2  outputs.  The  weights  in  the  input-to- 
hidden  layer  are  chosen  according  to  the  membership  functions. 

In  [2]  we  have  presented  a  minimal  counterexample  to  Park’s  conjecture,  and  we 
also  argued  that  a  mapping  is  only  possible  if  the  decision  boundaries  are  parallel.  In 
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the  next  theorem  we  show  that  neither  linear  nor  quadratic  membership  functions 
that  represent  the  convex  regions  induced  by  an  LTC  can  exist. 

Theorem  1  Lei  {M,  £,  T>)  be  a  decision  tree,  with  at  least  two  non-parallel  decision 
boundaries.  Then  the  convex  regions  induced  by  this  tree  cannot  be  represented  by 
a  set  of  quadratic  polynomials. 

Proof  Let  RiyR2  and  R3  be  regions  induced  by  a  decision  tree  such  that  the 
regions  Ri  and  R2  U  Rs  are  separated  by  the  hyperplane  di(aj)  =  G  IR".  We 
assume  that  di{x)  >  0  on  and  di(x)  <  0  on  R2  U  R3.  Similarly,  R2  and  R3  are 
separated  by  d2{x)  =  0,  such  that  d2{x)  >  0  on  i22  and  ^2(2?)  <  0  on  R3.  Note 
that  such  regions  will  always  be  induced  by  a  subtree  of  a  decision  tree,  unless 
all  decision  boundaries  are  parallel.  Let  mi, m2  and  m3  respectively  denote  the 
membership  functions  of  Ri,R2  and  R3.  By  definition  mi  =  0  on  the  hyperplane 
di(ar)  =  0.  Similarly,  m2  =  0  on  ^2(2;)  =  0.  (Note,  that  we  actually  know  only  that 
m2  is  zero  on  half  of  the  hyperplane  d2{x)  =  0.  However,  using  a  simple  result 
from  algebraic  geometry  it  follows  that  m2  must  be  zero  on  the  whole  hyperplane 
d2{x)  =  0.) 

Now,  let  £>12  -  mi  -  m2.  Then  D12  is  zero  on  di{x)  =  0,  because  £>12  >  0  on 
Ri  and  D12  <  0  on  i?2  U  R3.  As  a  consequence  of  Hilbert’s  Nullstellensatz  di  is  a 
factor  of  Di2.  Therefore,  there  exists  a  polynomial  function  e  such  that  D12  =  die. 
Since  jDi2  is  at  most  quadratic  by  assumption,  we  conclude  that  e  is  a  constant  or 
a  linear  function.  However,  since  both  die  and  di  are  positive  on  Ri  and  negative 
on  R2,  the  function  e  is  positive  on  RiUR2-  Since  the  degree  of  e  is  <  1,  e  must  be 
a  positive  constant.  Similarly,  we  have  £>13  =  dif,  where  /  is  a  positive  constant. 
Therefore  D23  =  di(/  -  e).  This  contradicts  the  fact  that  £>23  is  zero  on  d2{x)  =  0. 
□ 

Remark  In  [2]  it  is  shown  that  under  the  conditions  of  Theorem  1,  the  membership 
functions  mi  can  be  represented  by  multivariate  polynomials,  albeit  of  degree  >  5. 

3  An  Approximated  Mapping  of  a  Decision  Tree  onto  a 
One-hidden-layer  Neural  Network 

We  will  show  in  this  section  that  the  difficulties  encountered  in  the  preceding  section 
may  be  circumvented  by  requiring  that  the  points  we  consider  are  not  too  close  to 
the  hyperplanes  associated  with  the  decision  tree. 

Let  (ff,C,V)  be  a  decision  tree.  We  will  restrict  the  regions  Ri  by  assuming  that 
e  M  :  0  <  e  <  |dfc(a:)l  <  E,  where  e  and  E  are  positive  constants.  The  set  of 
points  in  Rt  satisfying  this  condition  will  be  denoted  by  St.  Hence,  St  is  a  convex 
subregion  of  Rt.  Note  also  that  Rt  can  be  approximated  by  St  with  arbitrary 
precision,  by  varying  the  constants  e  and  E. 

In  [9]  Park  considers  the  following  set  of  linear  membership  functions: 

^  stkCkdk{x),  (2) 

kEPt 

where  Pt  is  the  set  of  nodes  on  the  path  from  the  root  to  the  leaf  i.  The  constants 
stk  are  defined  such  that: 


Sikdk{x)  >  0. 


(3) 
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The  constants  ca;  >  0  are  determined  experimentally  in  [9].  Here  we  will  derive  an 
explicit  expression  for  these  constants.  Since  as  we  have  shown  above  that  in  general 
these  constants  cannot  exist  \i  x  e  £  C,  we  will  now  assume  that  x  G  Si. 

Theorem  2  Let  (AT,  D)  be  a  decision  tree.  Then  there  exists  a  set  of  linear 
functions  M  =  {mi,i  6  £},  such  that  for  any  £,£'  G  C,  with  £  ^  £! : 

Va?  E  5^  :  mi{x)  >  mi>(x).  (4) 


Proof  Let  T  be  the  set  of  terminal  nodes  of  the  tree.  An  internal  node  t  is  called 
terminal  if  both  children  of  t  are  leaves.  Further,  if  ni  and  n2  are  two  nodes,  then 
we  write  ni  <  712  if  ni  is  an  ancestor  of  713.  Suppose  that  ria  ^  T,  and  £,£'  are  two 
leaves  such  that  tIq  is  the  last  common  ancestor  oi  £  and  £'. 

We  decompose  the  function  mii  as  follows:  mit(x)  =  Ylni<n^  siiCidi{x)  + 

Y^nj>na  Si'3Cidj{x),  where  Ui  G  Pi  and  rij  G  Pv.  By  applying  (3)  to  the  node 
tIq,  it  can  be  seen  that  sia  =  —siia-  From  this  we  conclude  mi{x)  —  mii(x)  = 
2siaCada{x)  +  Ei  —  S2,  where  Ei  =  X^ni>na  siiCidi{x),  with  Ui  G  Pi  and  E2  = 
'^nj>na  Tij  G  Pi'.  To  assure  that  (4)  holds,  we  require  that  (E2  — 

'^i)/{‘^siada{x))  <  Ca.  Since  Ei  is  positive  we  can  satisfy  the  last  condition  by  taking 
^2/(2s£ada(a^))  <  Cq,  or  ^Ylnj>n^  ^  This  yields  the  following  sufficient 
condition  for  the  constants  cj : 


where  nj  G  Pi  and  Ca  is  the  set  of  leaves  of  the  subtree  with  root  Uq  .  Condition  (5) 
implies:  Ca  >  ~  J2nj>n^  where  nj  G  V  and  V  is  the  longest  possible  path  from 
node  ria  to  some  leaf.  From  condition  (5)  it  also  follows  that  the  constants  Ca  are 
determined  up  to  a  constant  factor.  It  is  easy  to  see  that  the  constants  can  be 
determined  recursively  by  choosing  positive  values  Ct  for  the  set  of  terminal  nodes 
t  eT.  □ 


Theorem  Z  Let  p  =  1/(1  +  g).  Then  mi{x)  =  Skip^^^^~^^^Uk{x)  are  linear 

functions  for  which  (4)  holds.  Here  6{k)  denotes  the  length  of  the  longest  possible 
path  from  node  Uf.  to  a  terminal  node. 


Proof  Using  the  formula  for  the  partial  sum  of  a  geometric  series  this  result  can 
be  obtained  straightforwardly,  see  [2].  □ 

Another  expression  for  the  constants  Cq  can  be  obtained  by  using  the  fact  that  a 
decision  tree  {ff,  £,  T>)  in  practical  situations  is  derived  from  a  finite  set  of  examples 
V  C  mr.  See  [2]. 

4  Another  Initialisation  Method 

In  this  section  we  consider  another  well-known  classifier:  Learning  Vector  Quan¬ 
tisation  (LVQ)  networks  [6,  7].  This  method  can  be  used  to  solve  a  classification 
problem  with  m  classes  and  data- vectors  a:  G  IR”.  It  is  known  that  the  LVQ-network 
induces  a  so-called  Voronoi  tesselation  of  the  input  space,  see  [6]  chapter  9.  Training 
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of  an  LVQ-network  yields  prototype  vectors  wj  G  IR”,  j  =  1, . .  .,m  such  that  an 

input  vector  x  belongs  to  class  j  iff  the  distance  to  wj  is  smallest: 

Vi  ^  j  :  ||u;;  -  x\\  <  Hiy*  -x\\^  x  £  Rj. 

It  is  easy  to  see  that  this  is  equivalent  with 

T  It  ^  t  It 
wj  X  -  -Wj  Wj  >w{x-  -w-  Wi. 

Now  define  the  linear  membership  function  rrii  as: 

/  \  T  1  T 

mi{x)  =  wt  X-  -Wi  Wi. 


Then 


X  e  Ri  ^1=^  mi{x)  >  mj{x),'ij  ^  i. 


Since  an  LVQ-network  is  good  classifier  and  can  be  trained  relatively  fast,  it  is 
a  good  alternative  for  the  initialisation  of  a  neural  net  using  linear  membership 
functions.  In  [2]  we  show  that  an  LVQ-network  cannot  induce  the  convex  regions 
induced  by  an  LTC, 

Discussion:  We  have  proven  that  linear  (or  quadratic)  membership  functions  rep¬ 
resenting  the  convex  regions  of  a  linear  decision  tree  in  general  do  not  exist.  How¬ 
ever,  we  give  explicit  formulae  for  the  approximation  of  such  functions.  This  allows 
us  to  control  the  degree  of  approximation.  Currently,  we  are  investigating  how  to 
determine  an  appropriate  approximation  in  a  given  application.  Furthermore,  we 
discussed  the  use  of  another  simple  classifier,  namely  an  LVQ-network  for  initialis¬ 
ing  neural  nets,  see  also  [2]. 
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Most  conventional  techniques  for  estimating  conditional  probability  densities  ^lre  inappropriate  for 
applications  involving  periodic  veiriables.  In  this  paper  we  introduce  three  related  techniques  for 
tackling  such  problems,  and  test  them  using  synthetic  data.  We  then  apply  them  to  the  problem 
of  extracting  the  distribution  of  wind  vector  directions  from  radeir  scatterometer  data. 

1  Introduction 

Many  applications  of  neural  networks  can  be  formulated  in  terms  of  a  multi- variate 
non-linear  mapping  from  an  input  vector  a;  to  a  target  vector  t:  a  conventional 
network  approximates  the  regression  (i.e.  average)  of  t  given  x.  But  for  mappings 
which  are  multi-valued,  the  average  of  two  solutions  is  not  necessarily  a  valid  solu¬ 
tion.  This  problem  can  be  resolved  by  estimating  the  conditional  probability  p(t\x). 
In  this  paper,  we  consider  techniques  for  modelling  the  conditional  distribution  of 
a  periodic  variable. 


2  Density  Estimation  for  Periodic  Variables 

A  commonly  used  technique  for  unconditional  density  estimation  is  based  on  mix¬ 
ture  models  of  the  form 

m 

=  (1) 

1=1 

where  a*  are  called  mixing  coefficients,  and  the  component  functions,  or  kernels, 
<f)i{t)  are  typically  chosen  to  be  Gaussians  [7,  9].  In  order  to  turn  this  into  a  model 
for  conditional  density  estimation,  we  simply  make  the  coefficients  and  adaptive 
parameters  into  functions  of  the  input  vector  x  using  a  neural  network  which  takes 
X  as  input  [4,  1,5].  We  propose  three  methods  for  modelling  the  conditional  density. 

2.1  Mixtures  of  Wrapped  Normal  Densities 

The  first  technique  transforms  x  6  IR  to  the  periodic  variable  0  G  [0,  2tt)  hy  6  = 
X  mod  27r.  The  transformation  maps  density  functions  p  with  domain  IR  into  density 
functions  p  with  domain  [0, 27r)  as  follows: 

oo 

m^)=  5D  P{»  +  N‘lir\x)  (2) 

N=  —  oo 

This  periodic  function  is  normalized  on  the  interval  [0,  27r),  since 

^27r  aoo 

/  p{e\x)d9  =  /  p{x\x)dx  =  1  (3) 

Jo  J-oo 

Here  we  shall  restrict  attention  to  Gaussian  (j): 

where  t  G  IR"^. 

A  mixture  model  with  kernels  as  in  equation  (4)  can  approximate  any  density 
function  to  arbitrary  accuracy  with  suitable  choice  of  parameters  [7,  9].  We  use  a 


exp<  - 


2ai{xy 
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standard  multi-layer  perceptron  with  a  single  hidden  layer  of  sigmoidal  units  and 
an  output  layer  of  linear  units.  It  is  necessary  that  the  mixing  coefficients  ai(a:) 
satisfy  the  constraints 

m 

=  0<ai(r)<l.  (5) 

i=l 

This  can  be  achieved  by  choosing  the  at(2:)  to  be  related  to  the  corresponding 
network  outputs  by  a  normalized  exponential,  or  softmax  function  [4].  The  centres 
Pi  of  the  kernel  functions  are  represented  directly  by  the  network  outputs;  this  was 
motivated  by  the  corresponding  choice  of  an  uninformative  Bayesian  prior  [4].  The 
standard  deviations  cri(a:)  of  the  kernel  functions  represent  scale  parameters  and  so 
it  is  convenient  to  represent  them  in  terms  of  the  exponentials  of  the  corresponding 
network  outputs.  This  ensures  that  (Ti{x)  >  0  and  discourages  <Ti{x)  from  tending 
to  0.  Again,  it  corresponds  to  an  uninformative  prior.  To  obtain  the  parameters  of 
the  model  we  minimize  an  error  function  E  given  by  the  negative  logarithm  of  the 
likelihood  function,  using  conjugate  gradients.  (The  maximum  likelihood  approach 
underestimates  the  variance  of  a  distribution  in  regions  of  low  data  density  [1].  For 
our  application,  this  effect  will  be  small  since  the  number  of  data  points  is  large.) 
In  practice,  we  must  restrict  the  value  of  AT.  We  have  taken  the  summation  over  7 
complete  periods  of  27r.  Since  the  component  Gaussians  have  exponentially  decaying 
tails,  this  introduces  negligible  error  provided  the  network  is  intialized  so  that  the 
kernels  have  their  centres  close  to  0. 

2.2  Mixtures  of  Circular  Normal  Densities 

The  second  approach  is  to  make  the  kernels  themselves  periodic.  Consider  a  velocity 
vector  V  in  two-dimensional  Euclidean  space  for  which  the  probability  distribution 
p{v)  is  a  symmetric  Gaussian.  By  using  the  transformation  =  ||i;||cos^,  Vy  = 
||t;||  sin^,  we  can  show  that  the  conditional  distribution  of  the  direction  9,  given  the 
vector  magnitude  ||t;||,  is  given  by 

which  is  known  as  a  circular  normal  or  von  Mises  distribution  [6].  The  normalization 
coefficient  is  expressed  in  terms  of  the  modified  Bessel  function,  and  the 

parameter  m  (which  depends  on  ||i;||)  is  analogous  to  the  inverse  variance  parameter 
in  a  conventional  normal  distribution.  The  parameter  tp  gives  the  peak  of  the  density 
function.  Because  of  the  Io{rri)  term,  care  must  be  taken  in  the  implementation  of 
the  error  function  to  avoid  overflow. 

2.3  Expansion  in  Fixed  Kernels 

The  third  technique  uses  a  model  consisting  of  a  fixed  set  of  periodic  kernels,  again 
given  by  circular  normal  functions  as  in  equation  (6) .  In  this  case  the  mixing  propor¬ 
tions  alone  are  determined  by  the  outputs  of  a  neural  network  (through  a  softmax 
activation  function)  and  the  centres  ipi  and  width  parameters  rrii  are  fixed.  We 
selected  a  uniform  distribution  of  centres,  and  m,-  =  m  for  each  kernel,  where  the 
value  for  m  was  chosen  to  give  moderate  overlap  between  the  component  functions. 

3  Application  to  Synthetic  Data 

We  first  consider  some  synthetic  test  data.  It  is  generated  from  a  mixture  of  two  tri¬ 
angular  distributions  where  the  centres  and  widths  are  taken  to  be  linear  functions 
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Figure  1  (a)  Scatter  plot  of  the  synthetic  training  data,  (b)  Contours  of  the  condi¬ 

tional  density  p{6\x)  obteiined  from  a  mixtiu'e  of  adaptive  circular  normal  functions 
^ls  described  in  Section  2.2.  (c)  The  distribution p(0|a;)  for  x  =  0.5  (solid  curve)  from 
the  adaptive  circular  normal  model,  compared  with  the  true  distribution  (dashed 
cxuve)  from  which  the  data  was  generated. 


of  a  single  input  variable  x.  The  mixing  coefficients  are  fixed  at  0.6  and  0.4.  Any 
values  of  0  which  fall  outside  (— tt,  tt)  are  mapped  back  into  this  range  by  shifting 
in  multiples  of  27r.  An  example  data  set  generated  in  this  way  is  shown  in  Figure  1. 
Three  independent  datasets  (for  training,  validation  and  testing)  were  generated, 
each  containing  1000  data  points.  Training  runs  were  carried  out  in  which  the 
number  of  hidden  units  and  the  number  of  kernels  were  varied  to  determine  good 
values  on  the  validation  set.  Table  1  gives  a  summary  of  the  best  results  obtained, 
as  determined  from  the  test  set.  The  mixture  of  adaptive  circular  normal  functions 
performed  best.  Plots  of  the  distribution  from  the  adaptive  circular  normal  model 
are  shown  in  Figure  1. 


Method 

Centres 

Hidden 

Units 

Validation 

Error 

Test 

Error 

Wrapped  normal 

6 

7 

1177.1 

1184.4 

Adaptive  circular  normal 

6 

8 

1109.5 

1133.9 

Fixed  kernel 

36 

7 

1184.6 

1223.5 

Table  1  Results  obtained  using  synthetic  data. 


4  Application  to  Radar  Scatterometer  Data 

The  European  Remote  Sensing  satellite  ERS-1  [8,  3]  has  three  C-band  radar  anten¬ 
nae  which  measure  the  total  backscattered  power  (Tq  along  three  directions.  When 
the  satellite  passes  over  the  ocean,  the  strengths  of  the  backscattered  signals  are 
related  to  the  surface  ripples  of  the  water  which  in  turn  are  determined  by  the 
low-level  winds.  Extraction  of  the  wind  vector  from  the  radar  signals  represents 
an  inverse  problem  which  is  typically  multi-valued.  Although  determining  the  wind 
speed  is  relatively  straightforward,  the  data  gives  rise  to  aliases  for  the  direction. 
For  example,  a  wind  direction  of  9  will  sometimes  give  rise  to  similar  radar  sig- 
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nals  to  a  wind  direction  of  ^  +  tt,  and  there  may  be  further  aliases  at  other  angles. 
Modelling  the  conditional  distribution  of  wind  direction  provides  the  most  complete 
information  for  further  processing  to  ‘de-alias’  the  wind  directions. 

A  large  set  of  ERS-1  measurements  has  been  assembled  by  the  European  Space 
Agency  in  collaboration  with  the  UK  Meteorological  Office.  Labelling  of  the  dataset 
was  performed  using  wind  vectors  from  the  Meteorological  Office  Numerical  Weather 
Prediction  model.  These  values  were  interpolated  from  the  inherently  coarse-grained 
model  to  regions  coincident  with  the  scatterometer  cells.  To  provide  a  challenging 
task,  the  data  was  selected  from  low  pressure  (cyclonic)  and  high  pressure  (anti- 
cyclonic)  circulations.  Ten  wind  fields  from  each  of  the  two  categories  were  used: 
each  wind  field  contains  19  x  19  —  361  cells  each  of  which  represents  an  area  of 
approximately  26km  x  26km.  Training,  validation  and  test  sets  each  contained  1963 
patterns. 

The  inputs  used  were  the  three  values  of  (Tq  for  the  aft-beam,  mid-beam  and  fore¬ 
beam,  and  the  sine  of  the  incidence  angle  of  the  mid-beam  (which  strongly  influences 
the  signal  received).  The  o-q  inputs  were  scaled  to  have  zero  mean  and  unit  variance. 
The  target  value  was  expressed  as  an  angle  (in  radians)  clockwise  from  the  satellite’s 
forward  path.  Table  2  gives  a  summary  of  the  preliminary  results  obtained.  This  is 
a  more  complex  domain  than  the  synthetic  problem  so  there  were  more  difficulties 
with  local  optima.  This  problem  was  considerably  reduced  (to  about  25%  of  the 
runs)  by  starting  training  with  the  centres  of  the  kernel  functions  approximately 
uniformly  spaced  in  [0, 27r).  Of  the  adaptive- centre  models,  the  one  with  six  centres 


Method 

Centres 

Hidden 

Units 

Validation 

Error 

Test 

Error 

Wrapped  normal 

4 

20 

2474.6 

2446.2 

Adaptive  circular  normal 

6 

20 

2308.0 

2337.9 

Fixed  kernel 

36 

24 

2028.9 

1908.9 

Table  2  Resiilts  on  satellite  data. 

has  the  lowest  error  on  the  validation  data:  however  fewer  centres  are  actually 
required  to  model  the  conditional  density  function  reasonably  well. 

5  Discussion 

All  three  methods  give  reasonable  results,  with  the  adaptive-kernel  approaches  beat¬ 
ing  the  fixed-kernel  technique  on  synthetic  data,  and  the  reverse  on  the  scatterom¬ 
eter  data.  The  two  fully  adaptive  methods  give  similar  results. 

Note  that  there  are  two  structural  parameters  to  select:  the  number  of  hidden  units 
in  the  network  and  the  number  of  components  in  the  mixture  model.  The  use  of  a 
larger,  fixed  network  structure,  together  with  regularization  to  control  the  effective 
model  complexity,  would  probably  simplify  the  process  of  model  order  selection. 
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Most  neural  network  models  talce  a  neuron  to  be  a  point  processor  by  neglecting  the  extensive 
spatial  structure  of  the  dendritic  tree  system.  When  such  structure  is  included,  the  dynamics  of 
a  nemal  network  can  be  formulated  in  terms  of  a  set  of  coupled  nonlinear  integro-differential 
equations.  The  kernel  of  these  equations  contains  all  information  concerning  the  effects  of  the 
dendritic  tree,  and  can  be  calculated  expHcitly,  We  describe  recent  results  on  the  einalysis  of  these 
integro-differential  equations. 


The  local  diffusive  spread  of  electrical  activity  along  a  cylindrical  region  of  a  neu¬ 
ron’s  dendrites  can  be  described  by  the  cable  equation 


dV 

dt 


(1) 


where  x  is  the  spatial  location  along  the  cable,  V{x,t)  is  the  local  membrane  po¬ 
tential  at  time  t,  e  is  the  decay  rate,  D  is  the  diffusion  coefficient  and  I{x,t)  is  any 
external  input.  Note  that  equation  (1)  is  valid  provided  that  conductance  changes 
induced  by  synaptic  inputs  are  small;  in  the  dendritic  spines  the  full  Nernst-Planck 


equations  must  be  considered  [1].  A  compartm'ental  model  replaces  the  cable  equa¬ 
tion  by  a  system  of  coupled  ordinary  differential  equations  according  to  a  space- 
discretization  scheme  [2] .  The  complex  topology  of  the  dendrites  is  represented  by 
a  simply-connected  graph  or  tree  F.  Each  node  of  the  tree  a  €  F  corresponds  to  a 
small  region  of  dendrite  (compartment)  over  which  the  spatial  variation  of  physical 
properties  is  negligible.  Each  compartment  a  can  be  represented  by  an  equivalent 
circuit  consisting  of  a  resistor  Ra  and  capacitor  Ca  in  parallel,  which  is  joined  to 
its  nearest  neighbours  <  /?,  a  >  by  junctional  resistors  Rap-  We  shall  assume  for 
simplicity  that  the  tree  F  is  coupled  to  a  single  somatic  compartment  via  a  junc¬ 
tional  resistor  R  from  node  ao  €  F.  (Figure  1).  The  boundary  conditions  at  the 
terminal  nodes  of  the  tree  can  either  be  open  (membrane  potential  is  clamped  at 
zero)  or  closed  (no  current  can  flow  beyond  the  terminal  node) . 

An  application  of  Kirchoff ’s  law  yields  the  result 


r 

"  dt  R, 


+  E 


Vp-Vg 

^aS 


U-Vo, 


;o  +  ^ct(t), 


dU  ^  U  Va,-U 
dt  R^  R 


(3) 


where  Va{t)  is  the  membrane  potential  of  compartment  a  G  F  and  U(t)  is  the 
membrane  potential  of  the  soma.  It  can  be  shown  that  there  exists  a  choice  of 
parameterisation  (where  all  branches  are  uniform)  for  which  equation  (2)  reduces 
to  the  matrix  form  [3] 


dV 


dt 


2(7QV(f )  -  (c  -b  2a)V{t)  -b  U{t)R  -b  1(f) 


(4) 
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Figure  1  Compartmental  model  of  a  neuron. 


where  a,  e  are  global  longitudinal  and  transverse  decay  rates  re¬ 

spectively.  The  matrix  Q  generates  an  unbiased  random  walk  on  the  tree  F.  That 
is,  Q  =  D“^A  where  A  is  the  adjacency  matrix  of  F,  Aap  =  1  if  the  nodes  a  and 
/?  are  adjacent  and  Aa^  =  0  otherwise,  and  D  =  diag(da)  where  da  is  the  coor¬ 
dination  number  of  node  a.  Our  choice  of  parameterisation  is  particularly  useful 
since  it  allows  one  to  study  the  effects  of  dendritic  structure  using  algebraic  graph 
theory  (see  below).  More  general  choices  of  parameterisation  where  each  branch  is 
nonuniform,  say,  can  be  handled  using  perturbation  theory. 

Since  the  output  of  the  neuron  is  determined  by  the  somatic  potential  U{t),  we  can 
view  the  dendritic  potentials  as  auxiliary  variables.  In  particular,  using  an  integrat¬ 
ing  factor  we  can  solve  equation  (4)  for  V(<)  in  terms  of  U{t)  and  I(^).  Substitution 
into  equation  (3)  then  yields  the  integro-differential  equation  (assuming  without 
loss  of  generality  that  V(0)  =  0and  RC  =1) 

^  =  -pU{t)  +  f  -  f)  [£/(<')««, «„  +  /c.(<')]  (5) 

■^0  aer 

where  p  —  l/RC  +  l/RC  and 

=  (6) 
is  the  dendritic  membrane  potential  response  of  compartment  a  at  time  t  due  to  a 
unit  impulse  stimulation  of  node  Jd  at  time  t  =  0  in  the  absence  of  the  term  U{t)a. 
on  the  right-hand  side  of  equation  (4).  All  details  concerning  the  passive  effects 
of  dendritic  structure  are  contained  within  G(^).  Note  that  in  deriving  equation 
(5)  we  have  assumed  that  the  inputs  Ia{t)  are  voltage-independent.  Thus  we  are 
ignoring  the  effects  of  shunting  and  voltage-dependent  ionic  gates. 

One  way  to  calculate  Gap{t),  equation  (6),  would  be  to  diagonalize  the  matrix  Q 
to  obtain  (for  a  finite  tree) 

Gapit)  =  (7) 

r 

where  {A^}  forms  the  discrete  spectrum  of  Q  and  are  the  associated  eigenvectors. 
It  can  be  shown  that  the  spectral  radius  p(Q)  =  1  and  an  application  of  the  Perron- 
Frobenius  Theorem  establishes  that  (a)  A  =  1  is  a  nondegenerate  eigenvalue  and  (b) 
eigenvalues  appear  in  real  pairs  ±Ar.  However,  such  a  diagonalization  procedure  is 
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rather  cumbersome  for  large  trees  and  does  not  explicitly  yield  the  dependence  on 
tree  topology.  An  alternative  approach  is  to  exploit  the  fact  that  Q  generates  a 
random  walk  on  F.  That  is,  expand  equation  (6)  to  obtain 

(8) 

r=0 

where  [Q^]a/?  is  equal  to  the  probability  of  an  unbiased  random  walk  on  F  reaching 
a  from  in  r  steps.  Using  various  reflection  arguments,  which  is  equivalent  to 
a  generalized  method  of  images^  one  can  express  Gapit)  as  a  series  involving  the 
response  function  of  an  infinite,  uniform  chain  of  dendritic  compartments  [3,4], 


(a/3) 

Y.  h^lM,  (2<t«)  (9) 

where  In  is  the  modified  Bessel  function  of  integer  order  n.  In  equation  (9),  the 
summation  is  over  trips  y  starting  at  node  /?  and  ending  at  node  a,  and  a  trip  is  a 
restricted  kind  of  walk  in  which  changes  of  direction  can  only  take  place  at  branching 
nodes  or  terminal  nodes  of  the  tree.  The  length  of  a  trip  (number  of  steps)  is  given 
by  M^.  The  constant  coefficients  6^  are  determined  by  various  factors  picked  up 
along  a  trip  according  to  the  following  “Feynman  Rules”,  (i)  A  factor  of  +1  on 
reflection  from  a  closed  terminal  node  and  a  factor  of  —1  on  reflection  from  an 
open  terminal  node,  (ii)  A  factor  of  2/da  when  a  trip  passes  through  a  branching 
node  with  coordination  number  da  >  2,  (iii)  A  factor  of  2/da  —  1  when  a  trip  reflects 
at  a  branching  node  of  coordination  number  da-  (iv)  A  factor  of  2/da  when  a  trip 
terminates  at  a  branching  node  (da  >  2)  or  a  closed  terminal  node  (da  =  1). 

The  series  (9)  will  typically  involve  an  infinite  number  of  terms.  However,  for  any 
fixed  t,  trips  with  length  >>  \/^  can  be  neglected  so  that  the  sum  over  y 
can  be  truncated  to  include  only  a  finite  number  of  terms.  Thus,  using  an  efficient 
algorithm  for  generating  trips  should  provide  an  efficient  method  for  calculating 
G(f)  at  small  and  intermediate  times.  Moreover,  one  can  also  extract  information 
concerning  the  long-time  behaviour  by  Laplace-transforming  (9) 
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where  e  =  e  -}-  2&.  The  resulting  summation  over  trips  can  be  performed  explicitly 
to  yield  a  closed  expression  for  G(z)  [5].  In  the  remainder  of  this  paper  we  consider 
some  examples  illustrating  the  effects  of  dendritic  structure  on  neurodynamics.  We 
shall  find  that  the  Laplace  transform  G(z)  plays  a  crucial  role  in  determining  the 
stability  of  these  systems. 

A  major  issue  at  the  single  neuron  level  is  the  effect  of  the  coupling  between  soma 
and  dendrites  on  the  input-output  response  of  a  neuron  satisfying  equation  (5), 
which  may  be  rewritten  in  the  form 


126 


Chapter  18 


f(Uj) 


Figure  2  Schematic  diagram  of  a  com-  Figure  3  A  network  of  compartmen- 

partmental  model  leaky  integrator  neu-  tal  model  neurons  indicating  the  axo- 

ron  (LIN)  indicating  the  feedback  arising  dendritic  connection  from  neuron  j  to 

from  the  electrical  coupling  between  the  compartment  oc  of  neuron  L 

soma  and  dendrites. 

m  =  f  (11) 

aer 

This  describes  a  leaky-integrator  neuron  with  external  input  I(t)  and  an  additional 
feedback  term  mediated  by  the  kernel  H{t)  =  Gaoao(O)  which  takes  into  account 
the  underlying  coupling  between  U{t)  and  the  dendritic  potentials.  The  model  is 
represented  schematically  in  figure  2.  Since  both  H(t)  and  i{t)  are  continuous  on 
[0,  oo)  we  can  apply  a  variation  of  parameters  formula  to  obtain 

U{t)  =  Z{t)U{0)  +  f  Z{t  -  t’)i{t')dt',  Z(t)  =  £-1  [— . ^  1  (t)  (12) 

Jo  Iz  p  —  H(z)_ 

where  C~^  indicates  the  inverse  Laplace  transform.  An  interesting  application  of 
th^  model  is  to  the  case  of  reset  where  the  neuron’s  somatic  potential  is  reset  to 
a  zero  resting  level  whenever  it  reaches  a  threshold  /i  and  fires.  Let  T„  denote 
the  nth  firing- time  of  the  neuron.  Then  U{t)  evolves  according  to  equation  (11) 
for  Tn  <  t  <  Tn+i  and  U{Tn)  =  h  so  that  U{T^)  =  0.  It  can  be  shown  that 
the  firing-times  evolve  according  to  a  second-order  difference  equation  rather  than 
a  first-order  one  typical  of  standard  integrate- and- fire  models.  (For  more  details 
see  Ref.  [6]).  Thus  the  coupling  between  soma  and  dendrites  can  have  non-trivial 
consequences  for  dynamical  behaviour. 

Now  consider  a  network  of  N  neurons  labelled  i  =  1,...,A  each  with  identical 
dendritic  structure.  Let  denote  the  synaptic  weight  of  the  connection  from 
neuron  j  impinging  on  compartment  a  G  F  of  neuron  i.  The  associated  output  is 
taken  to  be  of  the  form  Wljf(Uj)  where  /  is  a  sigmoidal  output  function,  (see  figure 
3).  Equation  (5)  then  leads  to  a  set  of  coupled  integro-differential  equations 


Note  that  for  simplicity  we  are  neglecting  the  feedback  arising  from  the  coupling 
between  the  soma  and  dendrites  of  the  same  neuron.  Further,  let  the  output  function 
/  satisfy  f(x)  =  tanh(/cic).  Then  /(O)  =  0  so  that  U  =  0  is  a  fixed  point  solution 
of  (13).  Linearization  about  the  zero  solution  with  /'(O)  =  k  gives 


(14) 
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Hii{t)  =  KY,Wf^GaAt)  (15) 

p 

Assume  that  the  weights  satisfy  the  condition  ^  ^  for  all  i,j  = 

so  that  the  convolution  kernel  H(^)  is  a  continuous  N  x  N  matrix  whose 

elements  lie  in  L^[0,  oo).  That  is,  \Hij(t)\dt  <  oo  for  all  i,j  =  l,...,N  It  can 

then  be  proven  [7]  that  the  zero  solution  of  equation  (14)  is  asymptotically  stable 
if  and  only  if 

A(z)  =  det  |^(/>  +  z)l  “  H(2)j  0  when  Rez  >  0  (16) 

where  H(z)  is  the  Laplace  transform  of  the  kernel  H  and  I  is  the  N  x  N  unit 

matrix,  which  can  be  calculated  explicitly  from  equations  (15)  and  (10).  Condition 
(16)  requires  that  no  roots  of  the  so  called  characteristic  function  should  lie  in  the 
right-half  complex  plane. 

It  can  also  be  established  that  if  the  stability  condition  (16)  is  met  for  the  linear 
system  (14),  then  the  zero  solution  of  the  full  nonlinear  system  (13)  is  locally 
asymptotically  stable  [7].  On  the  other  hand,  if  K{z)  has  at  least  one  root  in  the 
right-half  complex  plane  then  the  linear  system  (14)  is  unstable.  The  corresponding 
instability  of  the  full  nonlinear  system  is  harder  to  analyse;  in  such  cases  one  has 
to  bring  in  techniques  such  as  bifurcation  theory  to  investigate  the  effects  of  the 
nonlinearities  on  the  behaviour  of  the  system  at  the  point  of  marginal  stability 
where  one  or  more  eigenvalues  cross  the  imaginary  axis  [8]. 

It  is  clear  from  equation  (15)  that  the  stability  of  a  recurrent  network  depends 
on  the  particular  distribution  of  axo-dendritic  connections  across  the  network.  We 
end  this  paper  by  presenting  some  recent  results  concerning  such  stability.  Detailed 
proofs  are  presented  elsewhere  [8,9].  First,  suppose  that  the  weights  decompose 
into  the  product  form  Wf-  =  JijPa,  Pa>0,  and  J2a  Pa  —  Thus  the  distribution 
of  axon  collaterals  across  the  dendritic  tree  of  a  neuron  is  uncorrelated  with  the 
location  of  the  presynaptic  and  postsynaptic  neurons  in  the  network.  Assuming 
that  the  weight  matrix  J  can  be  diagonalized  with  eigenvalues  Wr,  r  =  1,  ...,N  then 
equation  (16)  reduces  to 


(z-\-p)-  KWr  ^  PpGaoPi^)  =  ^ 


per 

Two  particular  results  follow  from  equation  (17). 

I.  The  zero  solution  is  (locally)  asymptotically  stable  if 

r  n  -1 


K\Wr\  <  p 


p^T 


(17) 


for  all  r  =  1, ...,  A.  Taking  Pp  =  6p  p,  for  example,  shows  that  the  passive  membrane 
properties  of  the  dendritic  tree  can  stabilize  an  equilibrium.  This  follows  from  the 
property  that  for  compartments  sufficiently  far  from  the  soma,  GctQp(0)  <  Gaoao(O)- 
II.  The  passive  membrane  properties  of  the  dendrites  can  induce  oscillations  via  a 
supercritical  Hopf  bifurcation  [8].  At  the  point  of  marginal  stability  there  exists  a 
pair  of  pure  imaginary  roots  =  ±iLj  satisfying  equation  (17)  for  a  nondegenerate 
eigenvalues  w*  =  miurfi/^r}  such  that  u;*  <  0  and  all  roots  of  equation  (17)  for 
Wr  >  w*  have  negative  definite  real  part. 
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When  one  considers  the  synaptic  organisation  of  the  brain,  it  becomes  apparent 
that  the  decoupling  of  network  and  dendritic  coordinates  is  an  oversimplification. 
One  example  of  this  concerns  the  recurrent  collaterals  of  pyramidal  cells  in  the 
olfactory  cortex  [10].  These  collateralls  feed  back  excitation  onto  the  basal  dendrites 
of  nearby  pyramidal  ceils  and  onto  the  apical  dendrites  of  distant  pyramidal  cells. 
This  suggests  that  in  our  model  the  distance  of  a  synapse  from  the  soma  should 
increase  with  the  separation  \i  —  j\]  this  leads  to  a  reduction  in  the  effectiveness 
of  the  connection  (ignoring  active  processes).  On  the  other  hand,  there  is  growing 
experimental  evidence  that  the  reduction  in  the  effectiveness  of  distal  synapses  may 
be  compensated  by  a  number  of  mechanisms  including  increases  in  the  density  of 
synapses  at  distal  locations  and  voltage-dependent  gates  on  dendritic  spines  [10].  We 
can  incorporate  the  former  possibility  by  taking  to  be  an  increasing  function  of 
the  degree  of  separation  of  compartment  a  from  the  soma  at  ap  •  K  this  is  combined 
with  the  previous  feature  then  we  have  a  third  result. 

III.  The  passive  membrane  properties  of  the  dendritic  tree  can  induce  spatial  pat¬ 
tern  formation  (arising  from  a  Turing-like  instability)  in  a  purely  excitatory  or 
inhibitory  network  when  the  dependence  of  the  weight  distribution  on  dendritic 
and  network  coordinates  is  correlated.  Thus  one  does  not  require  a  competition 
between  excitatory  and  inhibitory  connections  (Mexican  hat  function)  to  induce 
the  formation  of  coherent  spatial  structures. 

Finally,  note  that  all  the  results  presented  in  this  paper  have  a  well-defined  con¬ 
tinuum  limit.  In  particular,  introduce  a  length-scale  /  corresponding  to  the  length 
of  an  individual  compartment.  One  then  finds  that  in  the  limit  I  — >■  0  such  that 
(tIP  ^  D,  equation  (4)  reduces  to  the  cable  equation  (1)  on  each  branch  of  the  tree 
such  that  current  dVJdx  is  conserved  at  each  branching  node  and  V  is  continuous 
at  a  branching  node.  Equation  (10)  with  a\  =  becomes 


1 
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The  statistical  principles  underpinning  hidden-layer  feed-forward  neural  net  works  for  fitting  smooth 
ciurves  to  regression  data  are  explained  and  used  as  a  basis  for  developing  likelihood-  and  bootstrap- 
based  methods  for  obtaining  confidence  intervals  for  predicted  outputs. 
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1  Introduction 

Hidden-layer  feed-forward  neural  networks  are  used  extensively  to  fit  curves  to  re¬ 
gression  data  and  to  provide  surfaces  from  which  classification  rules  can  be  deduced. 
The  focus  of  this  article  is  on  curve-fitting  applications  and  two  crucial  statistical 
insights  into  the  workings  of  neural  networks  in  this  context  are  presented  in  Section 
2.  Approaches  to  developing  confidence  limits  for  predicted  outputs  are  explored  in 
Section  3  and  some  conclusions  given  in  Section  4. 

2  Statistical  Insights 

Consider  the  following  single  hidden-layer  feed-forward  neural  network  used  to 
model  regression  data  of  the  form,  (xi,yi),  i  =  The  input  layer  comprises  a 

neuron  which  inputs  the  a:- variable  and  a  second  neuron  which  inputs  a  constant  or 
bias  term  into  the  network.  The  hidden-layer  comprises  two  neurons  with  logistic 
activation  functions  and  an  additional  neuron  which  inputs  a  bias  term  and  the 
output  layer  provides  the  predicted  y-value  through  a  neuron  with  a  linear  activa¬ 
tion  function.  The  connection  weights,  0  =  {9i, . . .  ,$7),  are  defined  in  such  a  way 
that  the  output,  o,  from  the  network  corresponding  to  an  input,  x,  can  be  written 
explicitly  as 

O  -  ^  1  e-(^3+e4T)  * 

If  in  addition  the  network  is  trained  to  minimize  the  error  sum-of-squares, 

Oi)^,  then  it  is  clear  that  implementing  this  neural  network  is  equivalent  to  using 
the  method  of  least  squares  to  fit  the  nonlinear  regression  model,  yi  —  Oi  -H  e*,  i  = 
1, . .  .,n,  where  the  error  terms,  €*,  are  independently  and  identically  distributed 
with  zero  mean  and  constant  variance,  to  the  data.  The  weights  of  the  network  cor¬ 
respond  to  parameters  in  the  regression  model,  training  corresponds  to  iteration  in 
an  appropriate  optimization  algorithm,  and  generalization  to  prediction.  In  fact  the 
nonlinear  regression  model  just  described  is  not  in  any  way  meaningful  in  relation 
to  the  data  and  the  broad  modelling  procedure  should  rather  be  viewed  as  one  of 
smoothing.  In  the  present  example,  two  logistic  functions  are  scaled  and  located  so 
that  their  sum,  together  with  a  constant  term,  approximates  an  appropriate  smooth 
curve.  Overall  therefore  it  is  clear  that  the  underlying  mechanism  of  a  hidden-layer 
feed-forward  neural  network  is  that  of  nonlinear  regression  and  that  the  broad 
principle  underpinning  such  a  network  is  that  of  nonparametric  regression. 

3  Confidence  Limits  for  the  True  Predicted  Values 

Consider  a  hidden-layer  feed-forward  neural  network  represented  formally  as  the 
nonlinear  regression  model, 

yi  =  r){xi,e)-jrei,  ,  z  =  l,...,n,  (2) 
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where  yi  is  the  observed  value  at  a;,-,  ^  =  (^i, . . . ,  ^p)  is  a  p  x  1  vector  of  unknown 
parameters,  7/(-)  is  a  nonlinear  function  describing  the  network,  and  the  error  terms, 
Cj,  are  uncorrelated  with  mean,  0,  and  constant  variance,  <7^.  Then  the  least  squares 
estimator  of  0,  denoted  9,  is  that  value  of  9  which  minimizes  the  error  sum-of- 
squares,  S(0)  =  XlLib*  “  ^^3  the  estimator  of  cr^,  denoted  s2,  is 

given  by  S{9)/(n  -  p).  The  mean  predicted  value  at  Xg  for  model  (2),  and  hence 
for  the  underlying  neural  network,  is  given  by  r){xg,0)  and  confidence  intervals  for 
the  corresponding  true  value,  T}{xg,9),  can  be  derived  from  likelihood  theory  or  by 
resampling  methods. 

3.1  Likelihood  Theory 

Linearization  method  :  Suppose  that  the  errors  in  model  (2)  are  normally  dis¬ 
tributed  and  suppose  further  that  this  model  can  be  satisfactorily  approximated 
by  a  linear  one,  and  that  V{9),  the  estimated  asymptotic  variance  of  the  least 
squares  estimator,  9,  is  taken  to  be  9{xi,  9)g(xi,9)'^]'^^  ,  where  g{xi,  9)  = 

dri{xi,9)/d9,i  =  and  the  subscript  9  denotes  evaluation  at  that  point. 

Then  the  standard  error  of  the  mean  predicted  value  at  Xg  is  given  by  SE[T}{xg,  ^)]  = 

^\j9(xg,9)'L  V{9)  g{xg,9)g,  where  g{xg,9)  =  d'n{xg,9)ld9,  and  an  approximate 
100(1  —  a)%  confidence  interval  for  r){xg,9)  can  be  expressed  quite  simply  as 

p{xg,9)±t^SEUxgJ)l  (3) 

where  denotes  the  appropriate  critical  t- value  with  n-  p  degrees  of  freedom. 
Example  1  :  Data  were  generated  from  model  (2)  with  deterministic  component 
(1)  corresponding  to  the  neural  network  described  in  Section  2  and  with  normally 
distributed  error  terms.  The  parameter  values  were  taken  to  be 

^  =  (0.5,  0.5, 1,-1, 0.1, 1,1.5) 

and  (j  =  0.01,  25  x- values,  equally  spaced  between  —12  and  12,  were  selected,  and 
simulated  y- values,  corresponding  to  these  x- values,  were  obtained.  The  approx¬ 
imate  95%  confidence  limits  to  r]{xg^9)  for  Xg  G  [-12, 12],  were  calculated  using 
formula  (3)  and  are  summarized  in  the  plots  of  ±t'^ SE[Tf{xg,9)\  against  Xg  shown 
in  Figure  1.  The  interesting,  systematic  pattern  exhibited  by  these  limits  depends 
on  the  design  points,  Xi,i  =  1, . . . ,  n. 

Profile  likelihood  method  :  Suppose  again  that  the  errors  in  model  (2)  are  normally 
distributed  and  let  S{9\r)^)  denote  the  minimum  error  sum-of-squares  for  a  fixed 
value,  of  the  true  predicted  value,  Then  the  profile  log-likelihood  for 

r}[xg,9)  is  described,  up  to  an  additive  constant,  by  the  curve  with  generic  point, 
and  a  likelihood-based  100(1  -  a)%  confidence  interval  for  'n{xg,9) 
comprises  those  values  of  Pg  for  which 

(4) 

For  Example  1  the  requisite  values  of  the  conditional  sum-of-squares,  S{9\T)g),  for  a 
particular  a;- value,  Xg^  were  obtained  by  reformulating  the  deterministic  component 
ofthemodelas?)(a;,^»)  =  >?,  +  with  + 

I  and  by  minimizing  the  resultant  error  sum-of-squares  for  appropriate 
fixed  values  of  the  parameter,  T]g  =  r){xg,9).  For  each  value  of  Xg  e  [-12,12] 
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so  considered,  the  profile  log-likelihood  for  the  true  predicted  value,  T){xgyO),  was 
approximately  quadratic  and  the  attendant  95%  confidence  limits  were  calculated  as 
the  two  values  of  rjg  satisfying  the  equality  condition  in  expression  (4)  by  using  the 
bisection  method.  A  plot  of  these  95%  confidence  limits,  centered  at  the  maximum 
likelihood  estimator,  r){xg^0),  against  Xg,  are  shown,  together  with  those  found  by 
the  linearization  method,  in  Figure  1.  It  is  clear  from  this  figure  that  the  confidence 


limits 


Figure  1  Linearization  (solid  line)  and  profile  likelihood  (dashed  line)  methods. 


limits 


Figure  2  Bootstrap  residuals  (solid  line)  and  linearization  (dashed  line)  methods. 


Figure  3  Bootstrap  pairs  (solid  line)  and  linearization  (dotted  line)  methods; 
fitted  curve  (dashed  line)  and  data  points  (circles). 


limits  for  r){xg,  0)  obtained  from  these  two  methods  are  very  similar. 
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3»2  Bootstrap  Methods 

Bootstrapping  residuals  :  Suppose  that  the  least  squares  estimate,  0,  of  the  pa¬ 
rameters  in  model  (2)  is  available.  Then  confidence  intervals  for  true  predicted 
values,  r}{xg,0),  can  be  obtained  by  bootstrapping  the  standardized  residuals,  e*  = 

[Vi  —  for  2  =  1, , . . ,  n,  following  the  procedure  for  regression  models 

outlined  in  [1].  For  example  1,  the  resultant  95%  bootstrap  confidence  limits  for 
predicted  values,  r}{xg,6),  with  x  €  [-12,12]  were  centered  at  the  corresponding 
bootstrap  means,  and  a  plot  of  these  limits  against  x  is  shown  in  Figure  2,  together 
with  the  the  corresponding  limits  obtained  from  the  linearization  method.  It  is 
clear  from  this  figure  that  the  broad  pattern  exhibited  by  the  two  sets  of  confidence 
limits  is  the  same  but  that  the  limits  are  systematically  displaced.  An  attempt  to 
correct  the  bootstrap  percentiles  for  bias  by  implementing  the  BCa  method  of  [1] 
produced  limits  which  were  very  different  from  the  uncorrected  ones. 

Bootstrapping  pairs  :  An  alternative  to  bootstrapping  the  residuals  is  to  bootstrap 
the  data  pairs,  (xi,yi),i  =  l,...,n,  directly,  following  the  procedure  given  in  [ij. 
Approximate  95%  confidence  intervals  for  the  true  predicted  values  of  Example  1 
were  obtained  in  this  way  and  the  results,  in  the  form  of  plots  of  the  confidence  limits 
centered  about  the  bootstrap  mean  against  x,  are  shown  in  Figure  3,  together  with 
a  plot  of  the  corresponding  centered  limits  obtained  from  the  linearization  method. 
Clearly  the  confidence  limits  obtained  by  bootstrapping  the  data  pairs  are  wildly 
different  from  those  found  by  the  other  methods  investigated  in  this  study.  The 
reason  for  this  is  clear  from  the  suitably  scaled  plots  of  the  data  points  and  the 
fitted  curve  which  are  superimposed  on  the  plots  of  the  confidence  limits  in  Figure 

3  so  that  the  a;-values  coincide.  In  particular,  only  4  of  the  25  observations  are  taken 
at  X- values  corresponding  to  the  steep  slope  of  the  fitted  curve  between  x  =  ~1  and 
a:  =  2.5  and  the  probability  that  at  least  one  of  these  points  is  excluded  from  the 
bootstrap  sample  is  high,  viz.  0.8463.  As  a  consequence  the  bootstrap  least  squares 
fitted  curve  is  expected  to  be,  and  indeed  is,  extremely  unstable  in  the  region  of 
this  slope. 

3.3  Comparison  of  Methods 

It  is  clearly  important  to  generalize  the  results  found  thus  far.  To  this  end,  400  data 
sets  were  simulated  from  the  model  setting  of  Example  1  and,  from  these,  coverage 
probabilities  with  a  nominal  level  of  95%  for  a  representative  set  of  true  predicted 
values  were  evaluated  using  the  likelihood-based  and  the  bootstrap  methods  of 
the  previous  two  subsections.  The  results  are  summarized  in  Table  1  and  clearly 
reinforce  those  obtained  for  Example  1.  In  particular,  the  coverage  probabilities 
provided  by  the  linearization  and  the  profile  likelihood-based  methods  are  close  to 
nominal,  while  those  obtained  by  bootstrapping  the  residuals  are  consistently  low 
over  the  range  of  a;-values  considered  and  those  obtained  by  bootstrapping  the  data 
pairs  are  very  erratic. 

4  Conclusions 

The  aim  of  this  present  study  has  been  to  critically  compare  selected  methods  for 
setting  confidence  limits  to  the  predicted  outputs  from  a  neural  network.  Both  of 
the  likelihood-based  methods  investigated  produced  surprisingly  good  results.  In 
particular,  the  linearization  method  proved  quick  and  easy  to  use,  while  the  profile 
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Method 

-10 

x-va] 

-5 

ues  for 

-2.5 

^rue  pre( 

0 

licted  oi 

2.5 

itputs 

5 

10 

Linearization 

95 

95.75 

95.75 

94.75 

92.25 

92.5 

94.75 

Profile  likelihood 

95.5 

94.5 

93.5 

96 

95.25 

95 

93.25 

Bootstrap  residuals 

91.75 

93.5 

94 

91.5 

90 

91.75 

94.5 

Bootstrap  pairs 

89.75 

90.75 

96.5 

97,25 

98.5 

92,5 

91 

Table  1 


Coverage  probabilities  for  a  nominal  level  of  95%  and  400  simulations. 


likelihood  approach,  which  is  more  rigorous,  was  a  little  more  difficult  to  implement. 
In  contrast,  the  bootstrap  methods  for  finding  the  required  confidence  limits  were, 
necessarily,  highly  computer- intensive,  and  the  results  disappointing.  On  balance, 
it  would  seem  that,  to  quote  Wu  [2],  “The  linearization  method  is  a  winner”. 
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We  present  and  analyze  a  Self  Organizing  Feature  Map  (SOFM)  for  the  NP-complete  problem  of 
the  travelling  salesman  (TSP):  finding  the  shortest  closed  path  joining  N  cities.  Since  the  SOFM 
has  discrete  input  patterns  (the  cities  of  the  TSP)  one  can  examine  its  dynamics  aneilytically.  We 
show  that,  with  a  p^u'ticulaT  choice  of  the  dist2tnce  fimction  for  the  net,  the  energy  associated  to 
the  SOFM  has  its  absolute  minimum  at  the  shortest  TSP  path.  Niunerical  simulations  confirm 
that  this  distance  augments  performances.  It  is  curious  that  the  distance  function  having  this 
property  combines  the  distances  of  the  neuron  and  of  the  weight  spaces. 

1  Introduction 

Solving  difficult  problems  is  a  natural  arena  for  a  would-be  new  calculus  paradigm 
like  that  of  neural  networks.  One  can  delineate  a  sharper  image  of  their  potential 
with  respect  to  the  blurred  image  obtained  in  simpler  problems. 

Here  we  tackle  the  Travelling  Salesman  Problem  (TSP,  see  [Lawler  1985],  [Johnson 
1990])  with  a  Self  Organizing  Feature  Map  (SOFM).  This  approach,  proposed  by 
[Angeniol  1988]  and  [Favata  1991],  started  to  produce  respectable  performances 
with  the  elimination  of  the  non-  injective  outputs  produced  by  the  SOFM  [Budinich 
1995].  In  this  paper  we  further  improve  its  performances  by  choosing  a  suitable 
distance  function  for  the  SOFM. 

An  interesting  feature  is  that  this  net  is  open  to  analytical  inspection  down  to  a 
level  that  is  not  usually  reachable  [Ritter  1992].  This  happens  because  the  input 
patterns  of  the  SOFM,  namely  the  cities  of  the  TSP,  are  discrete.  As  a  consequence 
we  can  show  that  the  energy  function,  associated  with  SOFM  learning,  has  its 
absolute  minimum  in  correspondence  to  the  shortest  TSP  path. 

In  what  follows  we  start  with  a  brief  presentation  of  the  working  principles  of  this 
net  and  of  its  basic  theoretical  analysis  (section  2).  In  section  3  we  propose  a  new 
distance  function  for  the  network  and  show  its  theoretical  advantages  while  section 
4  contains  numerical  results.  The  appendix  contains  the  detailed  description  of 
parameters  needed  to  reproduce  these  results. 

2  Solving  the  TSP  with  self-organizing  maps 

The  basic  idea  comes  from  the  observation  that  in  one  dimension  the  exact  solution 
to  the  TSP  is  trivial:  always  travel  to  the  nearest  unvisited  city.  Consequently,  let 
us  suppose  we  have  a  smart  map  of  the  TSP  cities  onto  a  set  of  cities  distributed 
on  a  circle,  we  will  easily  find  the  shortest  tour  for  these  “image  cities”  that  will 
give  also  a  path  for  the  original  cities.  It  is  reasonable  to  conjecture  that  the  better 
the  distance  relations  are  preserved,  the  better  will  be  the  approximate  solution 
found.  In  this  way,  the  original  TSP  is  reduced  to  a  search  of  a  good  neighborhood¬ 
preserving  map:  here  we  build  it  via  unsupervised  learning  of  a  SOFM. 

The  TSP  we  consider  is  constituted  of  N  cities  randomly  distributed  in  the  plane 
(actually  in  the  (0,1)  square).  The  net  is  formed  by  N  neurons  logically  organized 
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Figure  1  Schematic  net :  not  all  connec¬ 
tions  from  input  neurons  are  drawn. 


Figure  2  Weights  modification  in  a 
learning  step:  neurons  (small  gray  circles) 
are  moved  towards  the  pattern  q  (black 
circle)  by  an  amoimt  given  by  (l).  The 
solid  line  represents  the  neuron  ring.  The 
shape  of  the  deformation  of  the  ring  is 
given  by  the  relative  magnitude  of  the  Aw 
that  is  in  turn  given  by  the  distance  func¬ 
tion  hr  S' 


in  a  ring.  The  cities  are  the  input  patterns  of  the  network  and  the  (0,1)  square  its 
input  space. 

Each  neuron  receives  the  (a?,?/)  =  q  coordinates  of  the  cities  and  has  thus  two 
weights:  {wx,Wy)  =  w.  In  this  view  both  patterns  and  neurons  can  be  thought  as 
points  in  two  dimensional  space.  In  response  to  input  q,  the  r-th  neuron  produces 
output  Or  =  q-Wr.  Figure  1  gives  a  schematic  view  of  the  net  while  figure  2  represents 
both  patterns  and  neurons  as  points  in  the  plane. 

Learning  follows  the  standard  algorithm  [Kohonen  1984]:  a  city  qi  is  selected  at 
random  and  proposed  to  the  net;  let  S  be  the  most  responding  neuron  (ie  the 
neuron  nearest  to  gj)  then  all  neuron  weights  are  updated  with  the  rule: 

AWr  =  ehrs  {qi  -  Wr)  (1) 

where  0  <  e  <  1  is  the  learning  constant  and  hrs  is  the  distance  function. 

This  function  determines  the  local  deformations  along  the  chain  and  controls  the 
number  of  neurons  affected  by  the  adaptation  step  (1);  thus  it  is  crucial  for  the 
evolution  of  the  network  and  for  the  whole  learning  process  (see  figure  2). 

Step  (1)  is  repeated  several  times  while  e  and  the  width  of  the  distance  function 
are  being  reduced  at  the  same  time.  A  common  choice  for  hrs  is  a  Gaussian-like 
function  like  hrs  —  where  drs  is  the  distance  between  neurons  r  and  s 

(the  number  of  steps  r  and  s)  and  is  a  parameter  which  determines  the  number 
of  neurons  r  such  that  Awr  A  0;  during  learning  e,  cr  — ►  0  so  that  Awr  — >  0  and 
hrs  ^  ^r$ ' 

After  learning,  the  network  maps  the  two  dimensional  input  space  onto  the  one 
dimensional  space  given  by  the  ring  of  neurons  and  neighboring  cities  are  mapped 
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onto  neighboring  neurons.  For  each  city  its  image  is  given  by  the  nearest  neuron. 
From  the  tour  on  the  neuron  ring  one  obtains  the  path  for  the  original  TSP^ 

The  standard  theoretical  approach  to  these  nets  considers  the  expectation  value 
E[Awr\wr]  [Ritter  1992].  In  general  E[Awr\wr]  cannot  be  treated  analytically  ex¬ 
cept  when  the  input  patterns  have  a  discrete  probability  distribution  as  it  happens 
for  the  TSP-  In  this  case,  E[Awr\'^r]  can  be  expressed  as  the  gradient  of  an  energy 
function,  i.e.  E[Awr\wr]  =  —eA^^V{W),  with 

=  E  (2) 

rs  qi^F, 

where  W  =  {wr }  and  the  second  sum  is  over  the  set  of  all  the  cities  qi  having  Wg  as 
nearest  neuron,  i.e.  the  cities  contained  in  Eg  ,  the  Voronoi  cell  of  neuron  having  Wg. 
On  average,  V{W)  decreases  during  learning  since  .  E[AV\W]  =  —€j2r 
Substantially,  in  this  case,  there  exists  an  energy  function  which  describes  the  dy¬ 
namics  of  the  SOFM  and  which  is  minimized  during  learning;  formally,  the  learning 
process  is  the  descent  along  the  gradient  of  V{W)  .  Unfortunately  V(W)  escapes 
analytical  treatment  until  the  end  of  the  learning  process  when  some  simplifica¬ 
tions  are  applicable.  Since  at  the  end  of  learning  h^s  we  can  suppose  hrs  is 

significantly  different  from  zero  only  for  r  =  s,  s  ±  1;  in  this  case  (2)  becomes 

(2) 

In  addition,  simulations  support  that,  at  the  end  of  learning,  most  neurons  are 
selected  by  just  one  city  to  which  they  get  nearer  and  nearer.  This  means  that  Eg, 
contains  just  one  city,  let’s  call  it  ^(j),  and  that  Wg  qi{g):  consequently 

y{^)  -  ^  “  ^(3+1))^]  (4) 

5 

and  assuming  hn  symmetric  i.e.  =  hg^i^g  =  h,  we  get 

s 

where  Ltsp^  is  the  length  of  the  tour  of  TSP  considering  the  squares  of  the  distances 
between  cities.  Thus  the  Kohonen  algorithm  for  TSP  minimizes  an  energy  function 
which,  at  the  end  of  the  leaning  process,  is  proportional  to  the  sum  of  the  squares 
of  the  distances.  Numerical  simulations  confirm  this  result. 

3  A  New  Distance  Function 

Our  hypothesis  is  that  we  can  obtain  better  results  for  the  TSP  using  a  distance 
function  hrs  such  that,  at  the  end  of  the  process,  V{W)  is  proportional  to  the  simple 
length  of  the  tour  Tt5P,  namely  V{W)  oc  Yl,(qi(g)  -  ^(5-1)  =  Ltsp  since,  in 
general,  minimizing  Ltsp^  is  not  equivalent  to  minimizing  Ltsp-  We  thus  consider 

^The  main  weakness  of  this  algorithm  is  that,  in  about  heilf  of  the  cases,  the  map  produced  is 
not  injective.  The  definition  of  a  continuous  coordinate  along  the  neuron  ring  solves  this  problem 
yielding  a  competitive  algorithm  [Budinich  1995]. 
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Figure  3  Distances  between  neurons  Figure  4  Set  of /ija  given  by  (5)  for  0  < 

5  =  4  and  r  =  1:  Di^4  =  Di  +  D2  +  D3  Dis  <  1  and  =  0,  1,  . . 4. 

eind  di,4  =  3. 


a  function  hn  depending  both  on  the  distance  drs  and  on  another  distance  Drs 
defined  in  weight  space: 

3 

Drs= 

j=r+l 


If  we  define 


(5) 


when  — )■  0,  we  get  for 


hs±lyS  -  ^ 


1  + 


Ds±l,s 


-1 


I^S±l,S 


and  substituting  this  expression  in  (4)  we  obtain  V(W) 


=  —T 

2N 


[^(a)  ?i(a-l)l 


(9t(»)  9t'(s-l))  d"  1-* 


(9t(s)  -  Qi(s+l)y 


a 

=  ^Ltsp 

With  this  choice  of  /irs  the  minimization  of  the  energy  V{W)  is  equivalent  to  the 
minimization  of  the  TSP  path.  We  remark  that  the  introduction  of  the  distance 
Drs  between  weights  is  a  slightly  unusual  hypothesis  for  this  kind  of  nets  that 
usually  keep  well  separated  neuron  and  weight  spaces  in  the  sense  that  the  distance 
function  hrs  depends  only  on  the  distance  d^a- 


4  Numerical  Results 

Since  the  performances  of  this  kind  of  TSP  algorithms  are  good  for  problems  with 
more  than  500  cities  and  more  critical  in  smaller  problems  [Budinich  1995],  we  began 
testing  the  performances  produced  by  the  new  distance  function  (5)  in  problems 
with  50  cities. 
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City  set 

Min.  length 

[Durbin  1987] 

[Budinich  1995] 

This  algorithm 

1 

5.8358 

2.47% 

1.65% 

0.96% 

2 

5.9945 

0.59% 

1.66% 

3 

5.5749 

2.24% 

1.06% 

4 

5.6978 

2.85% 

1.37% 

0.70% 

5 

6.1673 

5.23% 

5.25% 

0.43% 

Average 

2.68% 

2.20% 

0.69% 

Table  1  Comparison  of  the  best  TSP  solution  obtained  in  10  runs  of  the  vari¬ 
ous  algorithms.  Rows  refer  to  the  5  different  problems  each  of  50  cities  randomly 
distributed  in  the  (0,1)  square.  Column  2  reports  the  length  of  the  best  known 
solution  for  the  given  problem.  Columns  3  to  5  contain  the  best  lengths  obtained 
by  the  three  algorithms  under  study  expressed  as  percentual  increments  from  the 
minimal  length;  the  number  of  runs  of  the  algorithms  is  respectively:  unknown,  5 
and  10.  Last  row  gives  the  increment  averaged  over  the  5  city  sets. 


We  compared  the  quality  of  TSP  solutions  obtained  with  this  net  to  those  of  two 
other  algorithms  both  deriving  from  the  idea  of  a  topology  preserving  map  and  that 
both  actually  minimizes  L^sp^  •  the  elastic  net  of  Durbin  and  Willshaw  [Durbin 
1987]  and  this  same  algorithm  with  a  standard  distance  choice. 

As  a  test  set,  we  considered  the  very  same  5  sets  of  50  randomly  distributed  cities 
used  for  the  elastic  net. 

Table  1  contains  a  comparison  of  the  best  TSP  path  obtained  in  several  runs  of 
the  different  algorithms  expressed  as  percentual  increments  over  the  best  known 
solution  for  the  given  problem.  Another  measure  of  the  quality  of  the  solution  is 
the  mean  length  obtained  in  the  10  runs.  The  percentual  increment  of  these  mean 
lengths,  averaged  over  the  5  sets,  was  for  this  algorithm  2.49%,  showing  that  even 
the  averages  found  with  the  new  distance  function  are  better  than  the  minima  found 
with  the  elastic  net. 

These  results  clearly  show  that  distance  choice  (5)  gives  better  solutions  in  this 
SOFM  application,  thus  supporting  the  guess  that  an  energy  V{W)  directly  pro¬ 
portional  to  the  length  of  the  tour  Ltsp,  is  better  tuned  to  this  problem. 

One  could  wonder  if  adding  weight  space  information  to  the  distance  function  could 
give  interesting  results  also  in  other  SOFM  applications. 

Appendix 

Here  we  describe  the  network  setting  that  produces  the  quoted  numerical  results. 
Apart  from  the  distance  definition  (5)  we  apply  a  standard  Kohonen  algorithm  and 
exponentially  decrease  parameters  e  and  a  with  learning  epoch  Ue  (a  learning  epoch 
corresponds  to  N  weights  update  with  rule  (1)) 

€  =  a  =  o-q/?"* 

and  learning  stops  when  e  reaches  5  •  10~^.  Numerical  simulations  clearly  indicate 
that  best  results  are  obtained  when  the  final  value  of  a  is  very  small  (?«  5  •  10"^) 
and  when  e  and  a  decrease  together  reaching  their  final  value  at  the  same  time. 
Consequently  given  values  for  a  and  (Tq  one  easily  finds  /?.  In  other  words  there  are 
just  three  free  parameters  to  play  with  to  optimize  results,  namely  €q  and  ao  and  a. 
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After  some  investigation  we  obtained  the  following  values  that  produce  the  quoted 
results:  eo  =  0.8,  (Tq  =  14  and  a  =  0.9996. 
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In  this  paper  Artificial  Neural  Networks  are  considered  as  ctn  example  of  the  semiparametric  class 
of  models  which  has  become  very  popular  among  statisticians  and  econometricians  in  recent  years. 
ModeUing  and  learning  aspects  are  discussed.  Some  statistical  procedures  are  described  in  order 
to  learn  with  infinite- dimensional  nuisance  parameters,  and  adaptive  estimators  are  presented. 
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1  Introduction 

We  look  at  the  interface  between  statistics  and  artificial  neural  networks  (ANN) 
and  study  stochastic  multilayer  neural  networks  with  unspecified  functional  com¬ 
ponents.  We  call  them  Semiparametric  Neural  Networks  (SNN).  The  fact  that 
no  absolute  requirements  come  from  biological  considerations  represents  an  impor¬ 
tant  motivation  for  SNN.  Many  technical  issues  arise  from  the  statistical  inference 
perspective;  we  (A)  stress  the  importance  of  computing  likelihood-based  criterion 
functions  in  order  to  exploit  the  large  sample  statistical  properties  pertaining  to 
the  related  estimators  and  (B)  measure  the  asymptotic  efficiency  of  the  estimators 
from  given  network  architectures. 

2  ANN  and  Likelihood  Theory 

A  stochastic  machine  with  a  vector  of  synaptic  weights  w,  the  input(output)  vectors 
x{y)  (given  a  conditional  density  function  f{y/x))  and  the  data  set  dimension  T, 
is  used  to  compute  w  through  a  learning  algorithm  based  on  the  minimization  of 
some  fitting  criterion  function  of  estimated  residuals  or  prediction  errors.  Consider 
a  likelihood  function  L{Y,0),  where  Y  =  (j/i, . . . ,  is  the  sample  observation 
vector  and  6  =  (^i,...,^p)'  is  a  vector  of  parameters.  These  three  elements  can 
fully  characterize  a  taxonomy  of  models,  differentiated  by  the  sample  size  and  the 
dimension  of  the  parameter  space  where  the  optimization  of  the  chosen  criterion 
function  must  be  done.  For  instance,  in  a  network  where  all  the  activation  functions 
are  specified  and  the  dimension  of  0  is  p  <  oo,  i.e.,  in  a  fully  parametric  situation, 
the  solution  to  the  optimization  problem  is  to  maximize  /  =  over 

the  weight  space  0.  On  the  other  hand,  the  activation  functions  can  be  reasonably 
left  unspecified,  apart  from  some  smoothness  conditions.  This  choice  permits  a  shift 
from  a  parametric  to  a  more  nonparametric  setting,  where  the  likelihood  function 
now  admits  a  modified  form  iy(y.  A),  with  A  =  {w,g,h)  and  w,g,h  that  indicate 
respectively  the  weight  vector,  the  unknown  output  and  hidden  layer  activation 
functions.  A  more  restricted  context,  but  still  semiparametric  in  nature,  occurs 
when  g  or  h  are  in  fact  specified;  networks  here  closely  resemble  the  semiparametric 
and  nonparametric  extensions  of  the  Generalized  Linear  Models  [7]  described  in 
[6].  As  a  result,  the  likelihood  function  is  now  more  constrained,  with  L(y,  5)  and 
6  =  (w,g,h)  or  6  =  {w,g,h).  If  a  stochastic  one-hidden  layer  feedforward  neural 
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network  is  represented  by: 

q  s 

y  =  G{x,w)  =  F[yo  +  '^oijf{yj  +  '^  Vij  )]  ( 1 ) 

i=l 

where  w  =  is  the  vector  of  weights  and  F/f  are  the  specified  out¬ 

put/hidden  layer  activation  functions,  then  a  likelihood-based  criterion  function, 
under  a  gaussian  distribution  of  p(y/x,w),  is  directly  related  to  the  sum-of-squared 
errors  scaled  by  the  output  variance.  For  F(z)  =  z,  we  have  that: 

max  '^lnp{yt/xt,w)^  min  ^(yt  -  G{xt,w)fi  (2) 

t  ^  i 

Whenever  Gaussianity  is  assumed  but  not  verified,  a  quasi- maximum  likelihood 
criterion  is  given.  Conditionally  gaussian  time  series  models  are  specified  in  terms  of 
a  gaussian  distribution  for  yt,  conditioned  on  the  information  available  at  time  t-1; 
here  too  a  likelihood  function  can  be  computed  and  thus  (2)  holds^ .  When  the  likeli¬ 
hood  function  relies  on  unknown  components  we  deal  with  an  ill-posed  optimization 
problem  and  some  form  of  regularization  is  required.  If  the  entire  parameter  vec¬ 
tor  can  be  decomposed  in  “interest”  w  and  “nuisance”  77  components,  such  that 
$  =:  {w,  77)  represents  the  ML  estimator  of  0,  6(w)  =  (w,  fj^)  is  a  constrained  initial 
estimator  for  9,  i.e.,  an  estimator  computed  from  an  initially  fixed  value  of  the 
parameters  of  interest,  not  necessarily  a  ML  value.  Inference  for  w  can  be  based  on 
the  so-called  profile  or  concentrated  log-likelihood  function  PL{w)  =  L{d{w)). 
We  seek  the  value  w*  which  maximizes  PL(w);  it  is  =  w,  at  least  for  the  case 
of  a  finite  dimensional  nuisance  parameter  space.  In  other  words,  a  ML  estimate 
is  computed  in  two  sequential  stages:  (1)  an  estimate  is  obtained  for  77,  given  an 
initial  consistent  value  of  w;  (2)  the  MLE  for  w  conditional  to  fj^  is  calculated  by 
maximizing  a  likelihood-based  criterion  function  where  the  previous  estimate  of  77 
has  been  plugged-in.  Hence  the  name  profile  log-likelihood. 

3  The  Semiparametric  Approach  to  ANN 

The  bias/variance  dilemma  [4]  is  crucial  in  ANN  as  it  is  in  nonparametric  statistical 
inference.  If  the  relation  between  the  input  and  the  output  variables  is  unknown, 
as  for  the  function  E{ytlTt)  in  nonlinear  regression  or  E{yt/yi-i)  in  time  series 
prediction  problems  ^  ANN  will  try  to  exploit  their  “universal  approximation” 
properties.  But  whereas  parametric  and  potentially  incorrect  models  lead  to  high 
bias,  nonparametric  models  lead  to  high  variance,  given  the  presence  of  parameters 
of  infinite  dimension;  therefore,  the  loss  of  efficiency  is  a  likely  price  to  pay.  We 
usually  observe  noisy  input  data,  according  to  some  underlying  probability  distri¬ 
bution;  thus,  for  some  aspects,  the  advantage  is  that  data  could  be  replicated  under 
certain  conditions  that  offer  a  solution  for  the  data-size  constraint  and  thus  soften 

^The  normalized  mean  squared  error  [12]  cost  function  E  =  ~  Vt)^ 

be  used.  Otherwise,  the  hkehhood  can  be  formed  in  terms  of  the  prediction  errors  vt  = 
yt  —  E{yt/yt-i,  -  ■  •  ,yi)y  given  that  var{vt)  =  var{yt /yt-i,  •  ■  -  ,yi)-  Moreover,  the  w  value  that 
minimizes  ^  equivalent  to  the  NLS  estimator  [13]  and  this  converges  to  the 

network  optimal  weights  w*  as  n  -+  00. 

or  Tt— 1  represent  information  up  to  the  time  t  or  t-1  (lags  only  included  for  Tt-i)- 
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the  “curse  of  dimensionality”^. 

An  important  aspect  in  SNN  is  related  to  the  functional  form  of  the  I/O  rela¬ 
tion.  The  approximation  ability  of  the  network  depends  on  the  characteristics  of 
the  underlying  functional  relation;  usually  sigmoidal- type  or  RBF-type  ANN  work 
well,  but  when  the  activation  functions  are  left  unspecified,  projection  pursuit  re¬ 
gression  [3]  and  other  highly  parameterized  techniques  represent  possible  solutions. 
We  consider  pure  minimization  and  iterative  estimation  strategies;  the  former  are 
based  on  the  decomposition  of  the  parameter  vector,  9  =  (w,  r;),  where  w  represents 
the  weights  and  the  bias  terms  considered  together  and  7]  represents  the  infinite¬ 
dimensional  nuisance  parameter^,  and  the  latter  are  Newton-Raphson  (NR)-type 
procedures.  We  address  the  optimization  problem  in  the  Profile  ML  setting  in¬ 
troduced  before.  But  another  challenging  issue  is  the  weight  estimation  accuracy, 
at  least  asymptotically.  By  working  with  an  initially  unrestricted  likelihood-based 
performance  measure  we  would  like  to  obtain,  in  statistical  terms,  a  lower  bound 
[8,  9,  10]  for  the  asymptotic  variance  (AV)  of  the  parametric  component  of  the  semi- 
parametric  model  such  that  we  are  able  to  generalize  the  usual  parametric  bound 
given  by  the  inverse  of  the  Fisher  Information  Matrix  (FIM).  We  equivalently 
calculate  the  Semiparametric  Efficiency  Bounds  (SEB)  for  the  parameters  of 
interest,  which  quantify  the  efficiency  loss  resulting  from  a  semiparametric  model 
compared  to  a  parametric  one^. 

4  Parametric  and  Nonparametric  Estimation  Theory 

The  discussion  here  draws  mainly  on  [8],  [10]  and  [11],  where  the  concept  of  marginal 
Fisher’s  bound  for  asymptotic  variances  in  parametric  models  is  generalized.  If  it  is 
true  that  a  nonparametric  problem  is  at  least  as  difficult  as  any  of  the  parametric 
problems  obtained  by  assuming  we  have  enough  knowledge  of  the  unknown  state 
of  nature  to  restrict  it  to  a  finite  dimensional  set,  it  is  important  to  look  for  a 
method  that  obtains  a  lower  bound  for  the  AV  of  the  parametric  component  of  the 
initially  unrestricted  model®.  Since  a  one-dimensional  sub-problem  asymptotically 
as  difficult  as  the  original  multidimensional  problem  often  exists,  we  could  express 
the  parameter  space  as  a  union  of  these  one-dimensional  sub-problems  (paths  or 
directions  along  w,  like  Fw  :  ^  F)  and  estimate  the  parameter  of  interest  to 

select  one  of  the  sub-problems  proceeding  as  if  the  true  parameter  would  lay  on 
this  curve.  The  question  is:  which  sub-problem  should  be  selected?  First,  we  should 
verify  the  existence  of  an  estimator  for  a  “  curve  ”  defined  in  the  infinite-dimensional 
nuisance  parameter  space,  corresponding  to  one  of  the  possible  one-dimensional 

^These  distributional  etssumptions  make  a  substantial  difference  in  terms  of  the  global  or 
local  statistical  efficiency  that  an  estimator  can  achieve.  While  for  global  efficiency  we  mean 
that  an  estimator  is  accurate  regardless  of  the  true  underlying  distributions,  for  local  efficiency 
the  same  holds  only  for  some  specific  distributions  related  to  the  nonparametric  component  of  the 
model. 

can  stand  for  the  unknown  hidden  layer  activation  functions,  the  noise  density  which  ran¬ 
domly  perturbs  the  input/output  pattern  data,  the  weights  or  even  the  same  activation  functions. 

*In  the  ANN  context  we  could  make  comparisons  between  SNN  and  sigmoidal-type  or  other 
networks  on  the  grounds  of  the  statistical  efficiency  of  the  related  learning  algorithms. 

®If  a  sequence  of  estimators  (^n)  satisfies  \/n{6n  ~  0)  N(0,V),  then  V  is  the  asymptotic 

variance  of  (0n)  and  for  the  ML  estimator,  \mder  regular  conditions,  equals  ,  where  1$  is  the 
FIM. 
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sub-probiems.  Given  0  =  consider  the  smooth  curve  6{t)  :  a  <  t  <  b 

designed  in  the  space  x  T  by  the  map  t  — ►  6{t)  =  7?(t)).  According  to  a 

possible  parameterization  inducted  by  w{t)  =  the  map  becomes  w  ^  {w,  rju))  with 
Vwo  =  and  for  each  curve  we  can  compute  the  associated  score  functions  ,Urj 
such  that,  U=.„„=  ■ii;(wo,Vo)  +  §^(m,r}o)(-£vw  U=«,„)  (or  simply 

Uai+U,,)  and  the  information  matrix  Eo(-£;l{w,  ij„)  =  E(Uw  +  U)^,  where 

U  G  span{Ujj).  Repeating  the  procedure  for  all  the  possible  we  can  find  the 
minimum  FIM,  which  is  given  by  inf{Eo{Uu)  +  U)^  =  Eo{Utu  +  U*)  =  iy,,  where 
U*  is  the  projection  of  Uy,  onto  span{Ur^).  In  particular,  we  define  the  curve  77*  for 
which  this  minimum  is  achieved  as  the  Least  Favorable  Curve  (LFC).  With  a 
semiparametric  model,  in  order  for  77*  to  be  LFC,  the  following  relation  must  be 
satisfied: 


J  S  L([  —  J  |iy=Tyo)  (3) 

for  each  curve  w  ^  rjy;  in  the  nuisance  functional  space  Following  the  parametric 
case,  the  marginal  FIM  for  w  in  a  semiparametric  model  is  given  by: 


iw  =  infrjE{[ 


dl{w,'qy,). 


|iy=iyo) 


or  alternatively  iyj  =  EoiUy,  +  U*),  with  U*  =  where  T  G  T 

represents  the  tangent  vector  to  the  curve  r}yj  at  wq  and  ^  is  a  Frechet  derivative®. 
Now,  =  V  is  the  lower  bound  for  the  AV  of  a  regular®  estimator  of  w  in  the 
semiparametric  model.  In  summary,  every  semiparametric  estimator  has  an  AV 
comparable  to  the  Cramer-Rao  bound  of  a  semiparametric  model;  therefore,  the 
bound  is  not  less  than  the  bound  of  every  parametric  sub-model,  i.e.  the  sup  V  of 
all  these  bounds.  This  is  the  semiparametric  AV  bound  or  SEE,  whose  attainment 
reflects  the  ability  to  estimate  adaptively. 

5  Adaptive  Estimation 

In  practice,  accurate  estimation  and  unrestricted  model  specification  can  be  con¬ 
trasting  aspects  of  a  problem.  Adaptive  estimation  is  a  solution  since  it  suggests 
that  full  efficiency  can  still  be  attained  even  under  an  initially  unrestricted  modeF®. 
A  pure  minimization  method  for  adaptively  estimating  semiparametric  models  is 
the  one  described  in  Section  2  and  adopted  in  [10],  the  Generalized  Profile  Like¬ 
lihood  (GPL)^^  For  a  regular  estimator  the  convolution  theorem  [2]  applies, 

^In  practice  [10]  for  a  nonparametric  regression  or  density  estimation  problem  fiw  often  con¬ 
verges  to  the  LFC;  otherwise,  we  could  verify  that  its  limit  is  such  that  a  least  favorable  direction 
is  found. 

®We  say  that  T  is  Frechet  differentiable  at  x  if  there  exists  a  function  Tx  '  X  Y  which  is 
linear  and  continuous  such  that  1 1^( ^ ^  H  =  0  (see  [2]). 


linear  and  continuous  such  that  ^  =  0  (see  [2]). 

® Given  a  loceil  DGP,  i.e.  a  stochastic  process  where  for  every  scimple  size  n  the  data  are 
distributed  according  to  9n,  with  y/n{6n  -  ^0)  bounded,  an  estimator  fV  C  0  is  called  regular 
in  a  parametric  sub-model  if,  for  each  true  value  of  the  parameter  vector,  ^/n{w  —  tt;(^n)) 
has  a  limit  distribution  not  dependent  on  the  local  DGP  and  regular  if  the  property  holds  for 
every  parametric  sub-model. 

^®An  adaptive  estimator  can  be  found  to  exist  or  not.  In  any  case,  the  model  is  required  to 
satisfy  conditions  related  more  to  the  underlying  structure  or  functional  relations  than  to  the 
possible  adaptive  estimation  procedures  [9]. 

^^Generalized  because  the  first  step  estimator  for  the  nuisance  parcimeter  vector  is  not  required 
to  be  an  ML  estimator. 
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thus  giving  an  asymptotic  distribution  for  y/n{w~wo)  equal  to  that  of  Z-fU,  where 
Z  ~  A^(0,  V)  and  U  an  independent  noise;  V  is  the  SEE,  as  shown  when  U  is  gaus- 
sian  and  thus  Var{w)  =  V  -\-E{UU')  »  V  .If  rjw  is  a  curve  in  T,  we  require  that 
the  estimator  w  obtained  by  maximizing  l{w,fjis)  has  an  asymptotic  distribution 
as  the  one  of  the  estimator  obtained  by  or  equivalently  we  require  that 

the  two  functions  have  the  same  local  behavior;  then,  the  SEE  should  be  attained. 
If  w  is  the  maximizer  of  /(ty, 7)^7),  then  \/n(w  —  wq)  -^d  is  expected,  be¬ 

cause  when  7/ii,  is  LFC  the  bias  disappears  asymptotically,  due  to  the  orthogonality 
of  the  score  functions.  Thus,  under  some  regularity  conditions,  an  estimator  for  w 
can  be  obtained  by  maximizing  the  GPL  criterion  function  and  the  estimator  is 
asymptotically  efficient.  The  value  of  the  SEE  is  given  by  the  inverse  of: 

•  _  I 

-  L  Q  /  Q  J 

n  ow'ow 

5.1  The  Algorithm  Applied  to  ANN;  an  Example 

Consider  the  example  in  [1]:  suppose  we  have  a  stochastic  machine  with  an  input 
vector  X  C  a  p-dimensional  weight  vector  w  and  the  binary  output  is  y  =  ±1 
according  to  p(y/x,w),  where  p{y  =  \/x,w)  =  k[f{x,w)]  and  p{y  =  ~l/x,w)  = 
1  —  k[f{x,w)],  k(f)  =  is  the  inverse  of  the  so-called  temperature  pa¬ 

rameter  and  f  represents  a  smooth  hidden  layer  activation  function  (therefore  our 
nuisance  parameters).  The  general  form  of  the  conditional  log-likelihood  function 
l{y/x,w)  =  lnp{y/x,w)  is  C  =  In +  (1  -  yi)ln[(l  -  I>(u;))],  where 

D(w)  =  k[f{xi,w)].  According  to  the  GPL  procedure,  we  (1)  estimate  initially  w 
via  BP  with  a  network  of  fixed  structure  (i.e.  f  represented  with  one  of  the  usually 
specified  forms)  (2)  obtain  a  nonparametric  estimate  f  conditionally  on  the  esti¬ 
mated  w,  thus  finding  D  (^)  plug  in  u  and  maximize  the  GPL  criterion  function 
w.r.t.  w  (where  D  in  fact  replaces  D*  which  represents  an  estimable  and  locally  ap¬ 
proximating  version  of  the  unknown  function  D)  )  calculate  w  and  repeat  steps 
2-3-4  uje  go  as  close  as  possible  to  the  Thus  the  stochastic  machine 

specifies  the  probabilities  p{y  =  ±l/x,w). 

5.2  Relationships  between  GPL  and  Lecirning  Consolidation 

When  fully  iterative  algorithms  are  considered,  consolidation  of  network  learn¬ 
ing  proposed  in  [13]  becomes  important,  since  EP  is  only  one  of  the  possible  recur¬ 
sive  m-estimators  that  locally  solve  optimization  problems  and  is  asymptotically 
inefficient  relative  to  a  NR-type  estimator,  regardless  of  the  local  minimizer.  Thus, 
EP  can  be  improved  with  regard  to  the  accuracy  of  the  estimate  through  just  one 
NR  step.  In  [8]  one-step  semiparametric  m-estimators  are  considered^^;  they  are 
based  on  the  zero  mean  and  finite  variance  influence  function  (IF)  'ip{z)  of  a  single 
observation  on  the  estimator,  such  that  y/n{^  —  <^)  =  ^  -|-  0^(1),  and  they 

^^Note  that  (A)  ly(jD)  that  maximizes  L[D{w))  should  behave  like  iy(Z)),  where  w{D)  = 
argsupwL{D{w))  (a  consistent  estimator  is  required,  given  that  L{D{w)  -^pr  L{D{w))  uni¬ 
formly  in  w)  (B)  given  AV  =  where  according  to  [1]  G  =  (gij)  is  the  FIM  and  gij  = 

is  the  generic  component  of  G,  we  must  compare  AV  with  the 

inverse  of  (5). 

These  estimators  are  asymptoticcJly  efficient  after  one  iteration  from  a  consistent  initial  esti¬ 
mate  of  the  parameter  of  interest  and  with  the  functions  of  the  likelihood  consistently  estimated. 
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solve  moment  equations  of  the  form  =  0,  given  a  general  interest  pa¬ 

rameter  (f).  The  m(.)  function,  similar  to  the  estimating  functions  adopted  in  [5], 
can  represent  the  First  Order  Conditions  for  the  maximum  of  a  criterion  func¬ 
tion  Q,  and  p  maximizes  the  expected  value  of  the  same  function.  This  method  is 
equivalent  to  GPL  when  f  is  the  density  from  a  given  distribution  function  F,  Q  is 
the  log-likelihood  function,  m(.)  the  score  function  and  77(0,  F)  is  the  limit,  for  the 
nonparametric  component,  that  maximizes  the  expected  value  of  lnf{zl<f^  p).  With 
^  =  argmax^Y^N'^f{zil<l>,p^)  as  the  GP(Max)L  estimator,  where  the  estimation 
of  p  does  not  affect  the  AV,  and  given  M  =  nonsingular,  we 

have  =  M~^m{z,  770)  and  which  is  the  one-step  version 

of  GP(Max)L  estimator. 

6  Conclusions 

We  analyzed  semiparametric  neural  networks,  described  a  general  model  set-up 
and  discussed  the  related  asymptotic  estimation  issues.  The  degree  of  success  in 
solving  the  bias/variance  dilemma  is  often  a  case-dependent  problem  requiring  a 
reparameterization  of  the  model  in  the  hope  of  finding  adaptive  estimators.  SEB 
tell  us  about  the  asymptotic  statistical  efficiency  of  the  chosen  estimator. 
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This  paper  describes  an  event-space  feedforward  network  based  on  partitioning  of  the  input  space 
using  maximmn  entropy  criterion.  It  shows  how  primitives  defined  as  partitioned  hyperceUs  (event 
space)  can  be  selected  for  the  puipose  of  class  discrimination.  Class  discrimination  of  a  hypercell 
is  evaluated  statistically.  Observed  primitives  corresponding  to  observed  characteristics  in  selected 
hypercells  are  used  as  inputs  to  a  feedforward  network  in  classification.  Preliminary  experimental 
results  using  simulated  data  and  as  it  pertains  to  speaker  discrimination  using  low-level  speech 
data  have  shown  very  good  classification  rates. 

1  Introduction 

This  paper  proposes  a  feedforward  network  whose  input  layer  is  reconfigured  during 
the  training  phase  depending  on  the  generation  and  selection  of  newly  defined  prim¬ 
itives.  As  the  primitives  are  identified  through  partitioning  of  the  input  outcome 
space,  so  would  the  construction  of  the  input  nodes  which  corresponds  to  defining 
the  primitive  set.  The  primitives  are  defined  through  the  selection  of  partitioned 
hypercells  corresponding  to  certain  feature  values  of  the  data  selected  for  classifi¬ 
cation.  Thus,  an  observed  primitive  in  a  datum  would  correspond  to  an  observed 
selected  characteristic  (or  range  of  values)  which  eventually  determine  the  datum’s 
classification.  Since  the  input  layer  in  the  network  is  reconfigured  depending  on  the 
selected  hypercells  identified,  we  call  it  a  self-configurable  neural  network  [2].  The 
two  processes  of  primitive  generation  and  clcissifi cation  are  integrated  and  ’’closely 
coupled” . 

2  Maximum  Entropy  Partitioning 

When  discretizing  the  outcome  space  based  on  partitioning  of  the  data,  the  dis¬ 
cretization  process  becomes  a  partitioning  process.  The  Maximum  Entropy  Par¬ 
titioning  (MEP)  process  generates  a  set  of  hypercells  through  partitioning  of  the 
outcome  space  of  n  continuous  valued  variables  (or  features)  [1,3, 4,7].  This  method 
bypasses  the  problem  of  non-uniform  scaling  for  different  variables  in  multivariate 
datum  compared  to  the  commonly  used  equal-width  partitioning  algorithm,  and 
minimizes  the  information  loss  after  partitioning  [3]. 

Given  n  variables  in  n-dimensional  space,  the  MEP  process  partitions  a  data  set 
into  hypercells  based  on  the  estimated  probabilities  of  the  data.  The  value  k 
represents  the  number  of  intervals  that  a  dimension  is  divided  into.  Let  P  be  the 
probability  distribution  where  the  process  produces  a  quantization  of  P  into: 

P{Ri),i  =  \,2,...,k^,  (1) 

where  Ri  denotes  a  hypercell.  To  maximize  the  information  represented  by  Ri,  the 
partitioning  scheme  which  maximizes  Shannon’s  entropy  function  defined  below  is 
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H{R)  =  -'£^PiRi)\ogP(IU).  (2) 

t  =  l 

The  function  H{R)  now  becomes  the  objective  function  where  information  can 
be  maximized  according  to  the  maximization  of  H(R).  With  one  variable,  maxi* 
mization  occurs  when  the  expected  probability  P{Ri)  is  approximately  l/k  of  the 
training  data  with  repeated  observations.  This  creates  a  narrower  interval  where  the 
probability  density  is  higher  [3].  In  the  proposed  method,  the  partitioning  process 
is  done  based  on  the  marginal  probability  distribution  on  each  variable  to  avoid 
combinatorial  complexity. 

In  our  data  representation  (such  as  low-level  speech  data),  the  jth  datum  is  repre¬ 
sented  by  a  set  of  I{j)  points  which  are  denoted  by  Zj  =  {xi  =  (xUiX2i, iCni)!*  = 
1, ...,  7(i)}.  That  is,  Zj  composes  of  sets  of  n- vectors.  A  set  of  partition  boundaries 
are  determined  by  combining  all  the  data  into  a  single  set  that  we  call  the  “data 
universe”  denoted  by  U.  Hence,  U  is  defined  as  the  union  of  all  the  training  data: 

U  =  {Zj\j:^l,...,J}  =  ZiUZ2U---UZj  (3) 

Each  data  is  assumed  to  have  an  assigned  class  label  Cm  in  7c  classes,  1  <  m  < 
Lc.  In  n-dimensional  space,  the  set  of  hypercells  generated  after  partitioning  is 
then  defined  as  R  =  =  1,2,...,^},  where  each  Ri  is  bounded  by  intervals 

that  partition  the  data  universe.  The  intervals  that  bound  Ri  are  composed  of  the 
boundary  points  which  will  be  determined  by  an  algorithm  [3]. 

3  Selection  of  the  Partitioned  Hypercells 

Representation  of  individual  datum  Zj  is  based  on  the  same  partitioning  scheme 
generated  from  partitioning  the  data  universe.  Each  generated  hypercells  from  the 
partitioning  cordons  off  a  set  of  points  in  Zj .  Since  our  partitioning  is  based  on  the 
marginal  probability  distributions,  a  hypercell  Ri  on  datum  Zj  that  has  significantly 
more  points  than  expected  can  be  evaluated  using  the  following  statistic  also  known 
as  the  standard  residual  [5]: 

DiRi,Zj)  = 

where  exp{Ri,  Zj)  is  defined  as  the  average  number  of  points  observed  in  a  hyper- 
cell  Ri,  calculated  as  M{Zj)/k'^,  and  ohs{Ri,Zj)  is  the  number  of  points  in  Zj 
observed  in  the  same  hypercell,  given  that  M{Zj)  is  the  total  number  of  points 
in  Zj.  Since  the  statistic  follows  a  normal  distribution,  it  is  therefore  possible  to 
evaluate  a  hypercell  which  has  significant  characteristic  in  the  data,  based  on  the 
normal  distribution  according  to  a  confidence  level.  If  the  expected  value  calculated 
using  the  null  hypothesis  so  that  each  hypercell  has  equal  probability  of  occurrence, 
then  this  equation  has  the  properties  of  an  approximate  normal  distribution  with 
a  mean  of  0  and  a  variance  of  1.  In  cases  where  the  asymptotic  variance  differs 
from  1,  then  an  adjustment  to  the  standard  residual  is  required  in  order  to  yield 
better  approximations  [5].  The  significance  of  a  hypercell  is  determined  by  com¬ 
paring  D{Ri,Zj)  to  the  tabulated  2- values  of  a  predefined  confidence  level  using 
the  standard  normal  distribution.  That  is,  the  hypercell  Ri  is  selected  as  significant 


ohs(Ri,  Zj)  —  exp(Ri,Zj] 
y^exp{Ri,Zj) 


based  on: 


D(Ri,Zj)>z 

otherwise 


(5) 
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where  z  is  the  tabulated  z- value  of  a  certain  degree  of  confidence  level. 

As  each  datum  is  partitioned  using  the  same  scheme  generated  from  partitioning  the 
data  universe,  the  number  of  data  Zj  with  significant  D(Ri,  Zi){o^6{Ri,Zj)  =  l) 
and  class  label  Cm  is  denoted  as  r]{Ri,Cm)^  (or  T][Ri,Cm)  =  Ez,  for 

^  Cm)’  Let  e{Ri)  be  the  average  number  of  data  per  class  whose  hypercell  Ri  is 
significant,  or: 

e(Ri)  =  g  jj 

■Lc 

and  let 


m)  =  E 


[ri{Ri,C„,)-e{m 

e{Ri) 


^Zj  e  u 


reflects  the  extent  of  class  discrimination  for  a  hypercell  Ri  in  the  data  universe. 
Since  9{Ri)  has  an  asymptotic  Chi-square  distribution,  the  relevance  of  a  hypercell 
can  be  evaluated  by  applying  the  Chi-square  test.  After  6{Ri)  is  calculated,  it 
is  compared  to  a  tabulated  value  with  Lc  —  I  degrees  of  freedom  based  on  a 
presumed  confidence  level.  The  function  described  in  equation  (3.5)  indicates  the 
hypercell’s  statistical  relevance  for  class  discrimination: 


m) 


1  if  e{Ri)  > 

0  otherwise 


(8) 


Hypercells  that  are  not  identified  as  statistically  relevant  are  partitioned  further, 
using  the  same  criterion  of  maximum  entropy,  until  there  exists  an  insufficient 
number  of  points,  or  a  predefined  depth  has  been  reached.  We  call  this  method  hi¬ 
erarchical  maximum  entropy  partitioning  [1,3].  The  rationale  of  using  partitioning 
iteratively  is  to  identify  useful  characteristics  at  a  more  restricted  interval  when  rel¬ 
evant  characteristic  is  not  found  at  a  larger  interval.  The  hypercells  that  surpass  the 
threshold  value  are  marked  as  having  an  acceptable  degree  of  class  discrimination 
with  l3{Ri)  =  1,  and  these  selected  hypercells  can  be  relabeled  as  {R^}  indicated 
by  the  superscript.  The  set  {i^J}  corresponds  to  the  set  of  input  nodes  in  the  feed¬ 
forward  network.  When  labeled,  as  (R^,  R^, . . . ,  R*g),  they  correspond  to  a  set  of 
data  value  characteristics  that  are  selected  to  have  acceptable  class  discrimination 
information. 


4  Self  Configurable  Event-Space  Feedforward  Network 

The  number  of  inputs  to  the  feedforward  network  depends  on  the  number  of  iter¬ 
ations  and  selected  hypercells.  As  more  iterations  of  the  partitioning  process  are 
performed,  then  more  hypercells  are  generated  and  identified  as  statistically  rele¬ 
vant,  This  is  analogous  to  the  use  of  more  refined  characteristics  of  the  data  for 
class  discrimination  if  sampling  reliability  is  not  a  concern. 

Given  a  datum  Zj  and  the  generated  hypercell  R*g.  Let  obs(i?J,  Zj)  be  the  number 
of  points  Zj  has  in  Rg,  exp(i?J,  Zj)  =  M{Zj)/k^  be  the  expected  average  number  of 
points,  where  M{Zj)  is  the  total  number  of  points  in  Zj.  Substituting  ohs{R*g,  Zj) 
and  exp{Rg,Zj)  into  equation  (4),  we  calculate  the  statistic  for  Zj ,  denoted  as 
D{R*g,Zj).  Define  a  binary  decision  function  for  all  the  selected  hypercells  {R*g} 
for  Zj  as: 


a(R*,Zj)  = 


1 

0 


iiD{R:,Zj)>z 

otherwise 


(9) 
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where  z  is  a  tabulated  z- value  for  a  given  confidence  level.  Then  Zj  is  represented 
by  a  ve<^tor  as: 

=  (a{RlZj)e{Rl),a(Rl  Zj)0{R;),  . .  MK.  Zj)e{R:), )  (10) 

where  a(Rl,Zj)  indicates  whether  a  selected  characteristic  is  observed  in  Zj  ,  that 
is,  a  selected  primitive  is  observed,  and  is  the  estimated  relevance  of  R*  based 

on  the  analysis  from  the  data  universe. 

A  binary  vector  Vj  can  be  used  as  an  approximation  to  Wj ,  for  each  data  where 
simplicity  of  inputs  is  desired.  It  is  defined  as: 

V;-  =  {a{RlZj)m,a{R;,  Zj)l3{Ri), . .  MK,  Zj)p(R:), )  (11) 

Each  component  a{R* ,  Zj)P(R*)  is  the  product  of  a{R*,Zj)  which  identifies  sig¬ 
nificant  characteristics  in  the  data  Zj,  and  I3(R*),  which  identifies  the  hypercells 
(or  primitives)  based  on  the  data  universe.  In  other  words,  a  component  is  1  only 
if  the  primitive  is  statistically  significant  in  Zj ,  and  statistically  discriminating  in 
U. 

This  approximation  does  not  provide  the  detailed  information  contribution  to  the 
class  discrimination  as  does  the  Wj  vector.  However,  Vj  usually  provides  faster 
training  times  with  additional  analysis  information.  /?(R*)  is  defined  as  a  binary 
element  in  the  vector  Vj  and  is  always  equal  to  1  for  the  selected  hypercells.  The 
I3{R*)  value  replaces  0(R*)  in  equation  (10)  so  that  it  is  now  represented  by  a  1, 
thus  Vj  is  rewritten  as: 

Vj=(a{RlZj),aiI{i,Zj),...a(R^s,Zj),)  (12) 

5  Training  and  Classification  of  the  Network 

A  network  can  be  trained  using  the  standard  back-propagation  algorithm  with  the 
supervised  class  label  for  each  datum  where  the  vector  Wj  or  Vj  is  the  input.  In  the 
testing  phase,  a  new  datum  Zj  with  an  unknown  class  label  is  assumed  to  belong  to 
one  of  the  given  classes,  hence  it  is  expected  that  the  partitioning  scheme  generated 
from  the  training  session  can  be  applied  as  well.  Zj  is  partitioned  to  the  predefined 
levels  according  to  the  same  scheme  identified  in  training  on  the  data  universe. 
Then  Zj  is  converted  to  the  corresponding  vectors  Wj  or  Vj  using  equation  (10)  or 
(12). 

These  vectors  are  applied  to  the  feedforward  network.  The  output  node  with  the 
highest  activation  which  surpasses  a  threshold  value,  identifies  the  class  of  the 
unknown  input  datum.  If  there  is  no  output  node  with  an  activation  surpassing  the 
threshold,  then  the  datum  remains  as  unclassified  (or  rejected). 

6  Experimentation  with  Speech  Data 

To  show  how  the  algorithm  performs  on  low  level  speech  data,  we  performed  an 
experiment  on  its  ability  to  identify  relevant  speech  characteristics  as  well  as  its 
ability  to  distinguish  the  speaker  identity.  The  data  used  for  these  experiments 
involved  3  speakers,  on  3  words  pronounced  as  “quit” ,  “printer”  and  “end” .  Three 
classes  of  data  were  defined,  one  corresponding  to  each  speaker.  Each  of  the  three 
words  was  spoken  10  times  by  each  speaker,  resulting  in  30  speech  sample  data  per 
speaker,  and  a  total  of  90  samples.  Each  speech  sample  is  represented  by  a  set  of 
points  composed  of  three  variables  as  a  3- vector  (time,  frequency,  amplitude).  A 
single  utterance  generated  slightly  over  12,000  data  points. 

The  experiment  used  the  “hold-out  test”  to  evaluate  the  algorithm  for  speaker 
identification.  Each  run  consisted  of  selecting  5  samples  on  a  word  randomly  from 
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each  speaker  class  for  training.  The  remaining  5  samples  were  used  for  testing.  With 
3  speakers,  each  run  consists  of  15  test  data  and  15  training  data.  With  3  words, 
and  performing  10  runs  on  each  word,  a  total  of  30  runs  was  done  for  a  total  of  450 
test  samples.  The  results  from  the  experiments  showed  that  the  system  performed 
reasonably  well.  A  total  of  369  out  of  450  were  classified  correctly.  Of  those  that 
were  not  classified,  about  half  were  rejected.  Total  success  rate  was  about  82%, 

7  Experimentation  with  Classifying  Data  Forming  Interlocking 
Spirals 

These  experiments  illustrate  the  algorithm  on  classifying  overlapping  interlocking 
spirals  of  points  [6].  In  this  set  of  experiments,  the  data  are  generated  using  different 
parameters  to  define  two  classes  of  data  so  that  each  data  consists  of  a  number  of 
points  forming  spirals.  Here,  the  points  in  a  spiral  were  artificially  generated  so 
that  the  points  in  a  spiral  could  overlap  with  points  in  another  data  even  though 
they  may  belong  to  different  class.  Thus  classification  of  each  data  has  to  depend 
on  a  large  number  of  points  jointly.  Each  data  sample  was  composed  of  96  points. 
To  create  a  problem  with  probabilistic  uncertainty,  but  more  difficult,  a  degree  of 
randomness  was  introduced  so  that  each  spiral  as  well  as  each  point  in  it  had  the 
radius  shifted  by  a  random  quantity. 

In  total,  60  data  samples  were  generated  for  use  in  all  the  experiments.  The  exper¬ 
iment  consisted  of  10  runs  where  each  run  was  composed  of  30  training  data,  15 
per  class.  The  test  set  then  used  the  remaining  30  unchosen  samples,  once  again 
15  per  class.  In  89  runs,  using  different  confidence  levels  and  number  of  intervals. 
The  results  based  on  a  total  of  2670  test  samples  were:  correctly  recognized  92.9%, 
rejected  5.0%  and  incorrectly  classified  2.1%. 
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The  aim  of  this  paper  is  to  ex£imine  the  application  of  radial  basis  function  (RBF)  network  to 
realise  the  decision  function  of  a  symbol-decision  equeJiser  for  digital  communication  system.  The 
paper  first  study  the  Bayesian  equaliser’s  decision  function  to  show  that  the  decision  function 
is  nonlinear  and  has  a  structure  identic2il  to  the  RBF  model.  To  implement  the  full  Bayesian 
equaliser  using  RBF  network  however  reqmres  very  large  complexity  which  is  not  feasible  for 
practical  applications.  To  reduce  the  implementation  complexity,  we  propose  a  model  selection 
technique  to  choose  the  important  centres  of  the  RBF  equaliser.  Our  results  indicate  that  reduced- 
sized  RBF  equaliser  can  be  foimd  with  no  significant  degradation  in  performance  if  the  subset 
models  are  selected  appropriately. 

Keywords;  RBF  network,  Bayesian  equaliser,  neural  networks. 


1  Introduction 

The  transmission  of  digital  signals  across  a  communication  channel  is  subjected 
to  noise  and  intersymbol  interference  (ISI),  At  the  receiver,  these  effects  must  be 
compensated  to  achieve  reliable  data  communications[l,  2].  The  channel,  consisting 
of  the  transmission  filter,  transmission  medium  and  receiver  filter,  is  modelled  as  a 
finite  impulse  response  (FIR)  filter  with  a  transfer  function  H {z)  =  a(2)z“* . 

The  effects  on  the  randomly  transmitted  signal  s{k)  =  s  =  {±1}  through  the 
channel  is  described  by 

Ua-l 

r{k)  =  r{k)  +  n(k)  =  ^  (1) 

1=0 

where  r(k)  is  the  corrupted  signal  of  s(k)  received  by  the  equaliser  at  sampled 
instant  time  k,  r{k)  is  the  noise-free  observed  signal,  n{k)  is  the  additive  Gaus¬ 
sian  white  noise,  a(i)  are  the  channel  impulse  response  coefficients,  and  Ua  is 
the  channel’s  memory  length[l,  2].  Using  a  vector  of  the  noisy  received  signal 
T{k)  =  [r(ib),  •  •  •,r(^  -  m-h  1)]^,  the  equaliser’s  task  is  to  reconstruct  the  trans¬ 
mitted  symbol  s{k  ~  d)  with  the  minimum  probability  of  mis- classification,  Pe^ 
The  integers  m  and  d  are  known  as  the  feedforward  and  delay  order  respectively. 
The  measure  of  an  equaliser’s  performance  Pe,  or  more  commonly  expressed  as  the 
bit  error  rate  (BER),  BER  =  logjoF’^,  in  communication  literature  [1],  is  expressed 
with  respect  to  the  signal  to  noise  ratio  (SNR)  where  the  SNR  is  defined  by 


SNR  = 


E[n^{k)] 


where  =  1  is  the  transmit  symbol  variance  and  crl  is  the  noise  variance. 

The  transmitted  symbols  that  affect  the  input  vector  r(fc)  is  the  transmit  sequence 
s{k)  =  ls{k),  -  ^  •  ,s{k-m~na-\-2]'^ .  There  are  Ns  =  possible  combinations 
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of  these  input  sequences,  i.e.  {s^},  1  <  j  <  iV^p].  In  the  absence  of  noise,  there  are 
Ns  corresponding  received  sequences  Tj(k),l  <  j  <  Ns,  which  are  referred  to  as 
channel  states.  The  values  of  the  channel  states  are  defined  by, 

Cj  =fj(A:)  =  F[sj],  l<j<Ns,  (3) 

where  the  matrix  F  £  fimx(m+na-i)  jg 

a(0)  a(l)  ...  a(na  -  1)  0  .  0 

0  a(0)  a(l)  ...  a(na  -  1)  0  .  0 

:  ^  .  0 

0  .  •••  . a(0)  a(l)  ...  a(na  -  1) 

(4) 

Due  to  the  additive  noise,  the  observed  sequence  r(^)  conditioned  on  the  channel 
state  T(k)  =  Cj  is  a  multi- variable  Gaussian  distribution  with  mean  at  cj, 

p(r(fc)|c^)  =  (2?r(T2)-™/2exp(-||r(i)  -  Cj||V(2<t2)).  (5) 

The  set  of  channel  states  Cd  =  can  be  divided  into  two  subsets  according 

to  the  value  of  s(k  —  d),  i.e. 

C^+'  =  {f(fc)|s(fc-d)  =  +!)},  (6) 

c'-)  =  {f(fc)|s(<r-d)  =  -!)},  (7) 

where  the  subscript  d  in  Cd  denotes  the  equaliser’s  delay  order  applied. 

To  minimise  the  probability  of  wrong  decision,  the  optimum  decision  function  is 
based  on  determining  the  maximum  a  posteriori  probability  P{s{k-d)  =  s\r(k))  [2] 
given  observed  vector  T{k),  i.e., 

s{k-d)  =  sgn{  P{s(k  ~  d)  = -\-l\r{k))  -  P(s{k  -  d)  = -l\r{k))  )  (8) 

where  s{k  -  d)  is  the  estimated  value  of  s{k  -d).  It  has  been  shown  in  [2]  that  the 
Bayesian  decision  function  can  be  reduced  to  the  following  form, 

Mr{k))  = 

Z)  exp(-|l''(fc)-Cj||V(2a-a))-  Z]  exp(-||r(i)-Ci||V(2ff2))  (9) 

It  is  therefore  obvious  that  /j(.)  has  the  same  functional  form  as  the  RBF  model  [2, 

3]  /rbf(r)^ 

N 

/rbfW  =  (10) 

i=l 

where  N  is  the  number  of  centres,  Wi  are  the  feedforward  weights,  0(.)  are  the 
nonlinearity,  c*  are  the  centres  of  the  RBF  model,  and  a  is  a  constant.  The  RBF 
network  is  therefore  ideal  to  model  the  optimal  Bayesian  equaliser  [2]. 

Example  of  decision  boundary:  As  an  example,  the  Bayesian  decision  bound¬ 
aries  realised  by  a  RBF  equaliser  with  feedforward  order  m  =  2  for  channel  H{z)  = 
0.5+  l.Oz  ^  is  considered.  Fig  la  lists  all  the  8  possible  combinations  of  the  trans¬ 
mitted  signal  sequence  s(A:)  and  the  corresponding  channel  states  Cj.  Fig.  lb  depicts 
the  corresponding  decision  boundaries  for  the  different  delay  orders.  Note  that  the 
decision  boundary  is  dependent  on  the  channel  state  positions  and  delay  order 
parameter. 
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Figure  1  (a)  Transmit  sequences  and  channel  states  for  channel  H (z) , 

(b)  Corresponding  Bayesian  decision  boundaries  for  various  delay  orders. 


2  Selecting  Subset  RBF  Model 

The  implementation  of  the  full  RBF  Bayesian  equaliser  requires  the  use  of  all  Ng 
channel  states.  Such  implementation  however  may  be  impractical  if  Ng  is  large.  In 
some  cases,  the  complexity  may  be  reduced  by  using  a  subset  of  the  Ng  channel 
states  to  generate  the  RBF  decision  function.  For  example,  it  is  obvious  that  the  de¬ 
cision  boundary  using  delay  d  =  1  for  H(z)  (Fig.  lb)  can  be  realised  approximately 
by  a  RBF  network  using  {cs,  C4,  cs,  cg}  as  centres.  If  the  realised  decision  boundary 
using  the  subset  RBF  equaliser  is  very  similar  to  the  full  Bayesian  equaliser,  the 
classification  performance  of  the  two  equalisers  would  also  be  very  similar.  That  is, 
the  implementation  complexity  of  the  RBF  equaliser  is  reduced  by  using  only  the 
important  channel  states  that  define  the  decision  boundary. 

To  understand  how  centres  affect  decision  boundary,  we  analyse  the  effects  of  centre 
positions  on  boundary  position  when  tTg  — >•  0.  Let  Fq  be  the  set  of  all  boundary 
points.  Le.,  /t(ro)  equals  to  0.  Therefore,  if  r(^)  G  fq,  Eq.  9  becomes 

exp(-||ro-c,  f/(2<T^))=  Y,  exp(-||ro  -  c|,|lV(2<r^))-  (H) 

When  (7e  — ^  0,  the  sum  on  the  l.h.s.  of  Eq.  11  becomes  dominated  by  the  closest 
centres  to  Fq,  i.e. 

{[/+}  =  min  {llro-Cj||}.  (12) 

c,€C<+> 

This  is  because  the  contribution  from  the  terms  exp(-llro  -c,-  W^H^a-D)  for  centres 
Cj  ^  converges  much  more  quickly  to  zero  when  <7e  — ^  0  than  terms  for  centres 
belonging  to  t/j".  Similarly,  the  sum  on  the  r.h.s  of  Eq.  11  becomes  dominated  by 
the  closest  terms  for  centres  belonging  to  U'^ ,  where  ”  ^k\\}- 
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At  very  high  SNR,  the  asymptotic  decision  boundaries  are  hyper-planes  between 
pairs  of  channel  states  belonging  to  {Uf }  and  {C/J  }  [4]. 

However,  not  all  channel  states  of  are  required  to  define  the  decision 

boundary.  This  can  be  observed  from  the  example  illustrated  in  Fig.  lb  for  decision 
boundary  realised  using  delay  order  d  =  2.  By  visual  insepection  (Fig.  lb),  it  is 
obvious  that  {03,07}  E  and  {04,  og}  E  .  The  decision  boundary  formed  using 
centres  {03,04}  and  {07,03}  are  however  the  same.  Therefore,  in  this  case,  only  1 
pair  of  channel  states,  either  {03,04}  or  {07,03},  is  sufficient  to  approximate  that 
region  of  decision  boundary. 

To  find  the  set  of  important  centres  subset  RBF  equaliser,  we 

propose  the  following  algorithm. 


Algorithm  1  :  Finding  , 

For  Cj  € 

For  c*  6 

=  0  and 

if  Cj  =  inm^.gp(+){||ro  -  Ci||}  and 

It)  5^0 

^da  I  ^  ^ds 

next  Ck, 
next  Cj. 


where  fs(.)  =  RBF  model  formed  using  the  current  selected  channel  states  from 
i^ds  5  ^ds }  ^  centres  and  /^(.)  is  the  full  RBF  Bayesian  equaliser’s  decision  function. 

2.1  Subset  Model  Selection  :  Some  Simulation  Results 

Simulations  were  conducted  to  select  subset  RBF  equalisers  from  the  full  model. 
The  following  channels  which  have  the  same  magnitude  but  different  phase  response 
were  used, 

Hl{z)  =  0.8745  +  0.4372^“^  -  0.20982“^  (13) 

H2{z)  =  0.2620 -  0.6647z-^-  0.6995^-2  (14) 

The  feedforward  order  used  was  m  =  4,  resulting  in  a  full  model  with  N,  - 

2m+na  1  —  54  centres.  Using  SNR  condition  at  16dB,  simulations  were  conducted 

to  compare  the  performance  of  the  subset  RBF  and  full  RBF  equalisers  for  the  two 
channels.  The  results  are  tabulated  in  Table  la  and  lb  respectively;  The  first  col¬ 
umn  of  each  table  indicates  the  delay  order  parameter,  the  second  column  indicates 
the  number  of  channel  states  selected  to  form  the  subset  model  while  the  third  and 
fourth  columns  list  the  BER  performance  of  the  two  equalisers  and  the  last  column 
indicates  if  the  channel  states  belonging  to  the  different  transmit  symbol,  i.e. 
and  Cj  \  are  linearly  or  not-linearly  separable.  Our  results  show  that  reduced  size 
RBF  equaliser  with  performance  very  similar  to  the  full  model’s  performance  can 
usually  be  found  for  equalisation  problem  that  is  linearly  separable. 
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Delay 

Subset 

Subset 

Full-model 

Decision 

Size 

log(Pe) 

log(Pe) 

Boundary 

0 

56 

-4.09 

-4.09 

Linear  Sep. 

1 

57 

-4.14 

-4.14 

Linear  Sep. 

2 

32 

-4.11 

-4.12 

Linear  Sep. 

3 

32 

-4.11 

-4.12 

Linear  Sep. 

4 

48 

-1.91 

-1.91 

Not-Lincar  Sep. 

5 

64 

-0.97 

-0.97 

Not-Linear  Sep. 

Table  a :  Channel  Hl(z) 


Delay 

Subset 

Subset 

Full-model 

Decision 

Size 

log(Pe) 

log(Pe) 

Boundary 

0 

56 

-0.80 

-1.30 

Not-Linear  Sep. 

1 

46 

-2.99 

-2.99 

Linear  Sep. 

2 

38 

-3.38 

-3.38 

Linear  Sep. 

3 

56 

-3.43 

-3.43 

Linear  Sep. 

4 

55 

-3.32 

-3.32 

Not-Linear  Sep. 

5 

64 

-3.41 

-3.41 

Not-Linear  Sep. 

Table  b  :  Channel  H2(z) 


Table  1  Comparing  the  performance  of  the  full-size  (64  centres)  RBF  equ^lliser, 
subset  RBF  equahser  for  Ch^lnnel  Hl{z)  (Table  la)  and  Channel  H2{z)  (Table  lb) 
at  SNR=16db. 


3  Conclusions 

This  paper  examined  the  application  of  RBF  network  for  channel  equalisation.  It 
was  shown  that  the  optimum  symbol- decision  equaliser  can  be  realised  by  a  RBF 
model  if  channel  statistic  is  known.  The  computational  complexity  required  to  im¬ 
plement  the  full  Bayesian  function  using  the  RBF  network  is  however  considerable. 
To  reduce  implementation  complexity,  a  method  of  model  selection  to  reduce  the 
number  of  centres  in  the  RBF  model  is  proposed.  Our  results  indicate  that  the 
model  size,  and  hence  implementation  complexity,  can  be  reduced  without  signifi¬ 
cantly  compromising  classification  performance  in  some  cases. 
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This  paper  presents  applications  of  graph  theory  to  the  design  of  graph  matching  neural  networks 
for  automated  fingerprint  identification.  Given  a  sparse  set  of  minutiae  from  a  fingerprint  image, 
complete  with  locations  in  the  plane  and  (optionally)  other  labels  such  as  ridge  angles,  ridge 
counts  to  nearby  minutiae  emd  so  on,  this  approach  to  matching  begins  by  constructing  a  graph¬ 
like  representation  of  the  minutiae  map,  utilizing  proximity  graphs,  such  as  the  sphere-of-influence 
graphs.  These  graph  representations  are  more  robust  to  noise  such  as  translations,  rotations  and 
deformations.  This  paper  presents  the  role  of  these  graph  representations  in  the  design  of  graph 
matching  neural  networks  for  the  matching  and  classification  of  fingerprint  images. 

1  Introduction 

Matching  the  representations  of  two  images  has  been  the  focus  of  extensive  research 
in  computer  vision  and  artificial  intelligence.  In  particular,  the  problem  of  matching 
fingerprint,  images  has  received  wide  attention  and  varying  approaches  to  its  solu¬ 
tion.  In  this  paper  we  present  results  of  an  ongoing  collaborative  research  program 
which  combines  techniques  and  methods  from  graph  theory  and  neural  science  to 
design  algorithms  for  graph  matching  neural  networks. 

This  collaborative  approach  with  Eric  Mjolsness,  University  of  Southern  California, 
San  Diego,  and  Anand  Rangarajan,  Center  for  Theoretical  and  Applied  Neural 
Science  at  Yale,  is  an  outgrowth  of  an  initial  investigation  for  the  Federal  Bureau 
of  Investigation  to  their  existing  Integrated  Automated  Fingerprint  Identification 
System,  lAFIS. 

In  this  research  program,  algorithms  and  techniques  from  discrete  mathematics, 
graph  theory  and  computer  science  are  combined  to  develop  methods  and  algo¬ 
rithms  for  representing  and  matching  fingerprints  in  a  very  large  database,  such  as 
the  one  at  the  FBI.  The  Federal  Bureau  of  Investigation  and  National  Institute  of 
Standards  and  Technology  provided  a  small  database  of  fingerprints.  This  database, 
together  with  the  software  environment  at  the  Center  for  Theoretical  and  Applied 
Neural  Science  at  Yale  University  have  provided  a  test  bed  for  these  algorithms. 
The  following  presents  the  background  of  the  fingerprint  problem  and  this  research 
together  with  a  special  emphasis  on  the  role  of  proximity  graphs  in  the  design  of 
the  graph  matching  neural  networks. 

2  Fingerprint  Images,  Minutiae  and  Graph  Representations 
2.1  Minutiae  Maps 

Fingerprint  matching  and  identification  dates  back  to  1901  when  it  was  introduced 
by  Sir  Edward  Henry  for  Scotland  Yard.  After  sorting  the  fingerprints  into  classes 
such  as  whorls,  loops  and  arches,  matches  are  made  according  to  comparisons  of 
minutiae.  Minutiae  include  such  indications  as  ridge  endings,  islands  and  bifurca¬ 
tions,  with  fingerprints  averaging  100  or  more  per  print.  Today  fingerprints  are 
initially  stored  on  computer  as  a  raw  minutiae  map  in  the  form  of  a  list  of  minutiae 
positions  and  ridge  angles  in  a  raster-scan  order.  In  American  courts  a  positive 
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matching  of  a  dozen  minutiae  usually  suffices  for  identification.  However,  for  an 
average  computer  to  make  these  dozen  matches  the  process  would  entail  locating 
every  minutiae  in  both  prints  and  then  comparing  all  ten-thousand-plus  possible 
pairings  of  these  minutiae.  In  addition,  the  minutiae  map  itself  is  very  non-robust 
to  likely  noise  such  as  translations,  rotations,  and  deformations,  which  can  change 
every  minutiae  positions.  A  subtler  form  of  noise  is  the  gradual  increase  in  ridge 
width  or  image  scale  typically  encountered  in  moving  from  the  top  of  the  image  to 
the  bottom.  Thus,  it  is  desirable  to  determine  graph-  like  representations  which  are 
more  robust  to  noise  and  less  susceptible  to  problems  with  missing  minutiae. 

2.2  Graph  Representations  of  Minutiae  Maps 

Given  a  sparse  set  of  minutiae  from  one  fingerprint  image,  complete  with  their 
locations  in  the  plane  and  (optionally)  other  labels  such  as  ridge  angles,  ridge 
counts  to  nearby  minutiae  and  so  on,  we  construct  a  graph-  like  representation  of 
the  minutiae  map.  By  considering  relationships  between  pairs  of  minutiae  such  as 
their  geometric  distance  in  the  plane,  or  the  number  of  intervening  ridges  between 
them,  we  can  begin  to  construct  features  which  are  robust  against  translations  and 
rotations  at  least.  However,  there  still  exists  the  very  serious  problem  of  reorderings 
of  the  minutiae  forced  by  rotation  and  missing  or  extra  minutiae.  This  “problem” 
must  be  addresses  by  defining  a  reordering-  independent  match  metric  between  two 
such  graphs. 

Complete  Graphs  and  Planar  Distances 

The  simplest  example  of  a  labelled  graph  representation  would  be  the  complete 
graph  where  every  pair  of  minutiae  are  linked  by  an  “edge”  in  the  graph.  Edges 
would  be  labelled  by  the  2-d  Euclidean  distance  between  the  minutiae.  Note  that 
this  graph  would  require  a  special  definition  of  the  match  metric  to  handle  miss¬ 
ing,  extra  and  reordered  minutiae.  Furthermore,  most  of  the  edges  would  connect 
distant  minutiae  whose  relationships,  such  as  planar  distance  or  ridge  count,  are 
subject  to  noise  and  provide  less  real  information  than  nearby  edges.  So  for  reasons 
of  robustness  and  computational  cost,  it  makes  sense  to  consider  instead  various 
kinds  of  “proximity  graphs” ,  which  keep  only  the  edges  between  minutiae  that  are 
“neighbors”  according  to  some  criterion. 

Sphere- of -influence  Graphs  and  Other  Proximity  Graph  Representations 
Sphere-of-Influence  graphs  comprise  the  first  set  of  proximity  graphs  which  we 
considered  in  our  goal  of  determining  a  better  class  of  minutiae  map  representations. 
First  introduced  by  Toussaint  [9],  sphere-  of-influence  graphs  provide  a  potentially 
robust  representation  for  minutiae  maps.  These  graphs  are  capable  of  capturing 
low-level  perceptual  structures  of  visual  scenes  consisting  of  dot  patterns.  .  A  very 
active  group  of  researchers  have  developed  a  series  of  significant  results  dealing 
with  this  class  of  graphs.  We  refer  the  reader  to  the  work  of  M.  Lipman,  [4,5,6];  F. 
Harary,  [5,6];  M.  Jacobson,  [5,6];  T.  S.  Michael, [7, 8];  and  T.  Quint,  [7,8]. 

The  following  definition  is  referred  to  by  Toussaint,  [  9  ].  Let  V  =  {Pi,...,Pn) 
be  a  finite  set  of  points  in  the  plane.  For  each  point  p  in  V,  let  Vp  be  the  closest 
distance  to  any  other  set  of  points  in  the  set,  and  let  Cp  be  the  circle  of  radius 
Tp  centered  at  p.  The  sphere-  of-influence  graph,  or  SIG,  is  a  graph  on  V  with  an 
edge  between  points  p,q  if  and  only  if  the  circles  Cp,Cq  intersect  in  at  least  two 
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places.  For  various  illustrations  of  sphere-of-influence  graphs  we  refer  the  reader  to 
the  excellent  presentation  by  Toussaint  in  [9]. 

One  can  note  from  the  prior  example  that  perceptually  salient  groups  of  dots  be¬ 
come  even  more  distinct  in  the  corresponding  sphere-of-influence  graph.  However, 
SIGs  represent  only  one  group  of  an  even  richer  class  of  graphs  we  refer  to  as 
proximity  graphs.  These  graphs  also  offer  various  benefits  in  their  potential  for 
providing  robust  representations  of  minutiae  maps.  Proximity  graphs  all  share  the 
property  that  they  only  contain  edges  corresponding  between  minutiae  that  are 
“neighbors”  according  to  some  given  criterion.  The  graphs  which  have  turned  out 
to  be  most  promising  representations  include  relative  neighborhood  graphs,  delauney 
triangulations,  voronoi  diagrams,  minutiae  distance  graphs  and  generalized  sphere- 
of-influence  graphs.  We  refer  the  reader  to  an  excellent  survey  article  by  Toussaint 
for  more  details  on  proximity  graphs.  [9] 

Generalized  Sphere- of-Influence  Graph  Representations 

K-sphere-of-influence  graphs  (k-SIGs)  are  generalizations  of  SIGs,  in  which  a  vertex 
is  connected  to  k  nearby,  vertices  depending  only  on  their  relative  distances,  not 
absolute  distances. (Guibas,  Pach,  and  Sharir  [10]).  Given  a  set  P  of  n  points  in  R^, 
the  kth  sphere- of-  influence  of  a  point  x  in  is  the  smallest  closed  ball  centered  at  x 
and  containing  more  than  k  points  of  V  (including  a?).  The  case  ioi  k  =  1  yields  the 
standard  sphere-of-influence  graph.  The  it  kth  sphere-of-influence  graph,  Gk(V)  of 
P  is  a  graph  whose  vertices  are  the  points  of  V,  and  two  points  are  connected  by 
an  edge  if  their  k-th  spheres  -of  -  influence  intersect. 

K-SIGs  are  especially  suitable  for  minutiae  map  representations  because  of  the  fact 
that  connections  depend  on  relative  distances.  This  property  provides  a  form  of 
scale  invariance.  Each  edge  can  be  labelled  with  the  integer  k,  recording  whether 
it  connects  nearby  (k  =  1)  or  farther  minutiae  pairs.  Unfortunately,  this  scale 
robustness  is  bought  at  the  price  of  increased  susceptibility  to  missing  minutiae. 
When  a  minutiae  goes  missing,  not  only  is  there  an  unavoidable  effect  on  the  graph 
by  the  deletion  of  the  node,  but  there  is  a  gratuitous  “splash”  effect  of  the  edges 
between  nearby  pairs  of  the  remaining  minutiae:  their  k-numbers  change  despite 
the  fact  that  their  planar  distance  do  not.  This  effect  is  mitigated  by  the  match 
metric,  which  changes  only  gradually  with  k,  but  it  is  still  undesirable. 

Finally,  we  define  yet  another  graph  which  is  a  hybrid  between  planar  distance 
graphs  and  k-SIGs.  We  begin  by  creating  a  k-SIG  with  an  overly  large  value  of 
k.  However,  we  label  the  edges  with  the  planar  distance  d.  Next  we  find  the  local 
image  scale  factor  by  finding  the  best  constant  of  proportionality  between  d  and  k 
in  each  image  region.  Divide  all  d’s  by  this  coefficient  to  turn  them  into  less  noisy, 
noninteger  versions  of  k,  and  then  let  this  scaled  version  of  d  be  the  importance 
rating  for  an  edge.  Then  proceed  to  a  graph  and  a  match  metric  as  in  the  planar 
distance  example.  For  this  hybrid  representation  minutiae  deletion  only  affects  d 
via  the  local  scaling  factor,  which  is  determined  by  many  different  value  of  k  and 
is  therefore  fairly  robust,  preserving  scale  invariance. 

3  Graph- Matching  Neural  Network  Implementation 

Graph  representations  of  minutiae  maps  provide  only  the  first  step  in  developing  a 
matching  scheme  for  fingerprints.  This  second  part  of  this  research  effort  is  devoted 
to  the  design  of  graph  matching  algorithms  and  their  implementation  as  neural 
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networks.  As  outlined  in  prior  sections  these  algorithms  are  based  on  proximity 
graphs  which  only  contain  edges  for  nearby  minutiae.  These  edges  are  then  la¬ 
belled  with  features  and  with  and  “edge  importance  rating”  that  directly  affects 
the  weight  given  to  that  edge  in  the  match  metric.  In  this  scenario  one  can  construct 
a  nested  series  of  graphs  including  more  and  more  edges  of  less  and  less  importance 
or  greater  and  greater  distance,  until  either  some  cut  off  limit  is  reached  or  all 
possible  edges  between  minutiae  in  the  map  are  included  in  the  largest  graph.  In 
this  class  of  algorithms  the  match  metric  between  two  such  graphs  is  insensitive  to 
arbitrary  reorderings  of  the  minutiae  in  either  graph,  and  is  implementable  as  an 
analog  neural  network.  Finally,  the  match  metric  will  contain  adjustable  weighting 
parameters  which  determine  the  relative  weights  of  matching  edges  as  a  function 
of  their  importance  rating,  and  the  relative  weights  of  any  other  labelling  infor¬ 
mation  attached  to  the  vertices  or  edges  of  the  graphs.  These  parameters  are  then 
determined  statistically  by  a  steepest-descent  training  algorithm  applied  to  a  data 
base  of  minutiae  maps,  followed  by  testing-set  validations,  exactly  as  is  done  in  the 
conventional  neural  network  paradigm. 

3.1  Relaxation  Networks  for  Graph  Matching  —  Deterministic 
Annealing  and  Lagrangian  Decomposition 

The  neural  network  implementation  of  this  research  effort  has  been  conducted  at 
Yale  University’s  Center  for  Theoretical  and  Applied  Neural  Science  in  cooperative 
efforts  with  E.  Mjolsness  and  A.  Rangarajan.  This  implementation  has  concen¬ 
trated  on  the  use  of  relaxation  neural  networks  for  matching  the  labelled  graphs 
corresponding  to  minutiae  maps.  It  should  be  noted  that  neural  network  approaches 
to  graph  matching  share  the  feature  of  other  recent  approaches  to  graph  matching 
in  that  they  are  not  restricted  to  looking  for  isomorphisms,  but  seek  to  minimize 
the  distance  between  two  graphs.  Graph  isomorphism  occurs  when  the  distance  is 
zero.  The  approach  at  Yale  utilizes  deterministic  annealing  to  generalize  to  inexact, 
weighted  graph  matching. 

In  [11]  Mjolsness  and  Rangarajan  formulate  the  problems  as  follows;  given  graphs 
G  and  g,  find  a  permutation  matrix  m  that  brings  the  two  sets  of  vertices  into  corre¬ 
spondence.  A  permutation  matrix  is  a  zero  one  matrix  whose  rows  and  columns  sum 
to  one.  Within  the  framework  of  deterministic  annealing  one  can  easily  formulate 
the  permutation  matrix  constraints.  In  this  setting  the  row  or  column  constraints 
are  winner-take- alls  and  either  set  (but  not  both  sets)  of  constraints  separately  can 
be  exactly  imposed  used  deterministic  annealing  methods. 

This  deterministic  annealing  approach  is  similar  to  a  Lagrangian  decomposition 
approach  in  that  the  row  and  column  constraints  are  satisfied  separately.  Lagrange 
multipliers  are  then  used  to  equate  the  two  solutions.  Other  methods  do  exist  to 
satisfy  both  the  row  and  column  constraints  and  this  method  (for  obtaining  a 
permutation  matrix)  is  equivalent  w.r.t.  fixpoints  to  the  one  presented  in  [12].  A 
fixpoint  preserving  transformation  [13]  is  applied  to  the  graph  matching  distance 
resulting  in  the  ability  to  express  the  combination  of  the  graph  matching  distance 
constraint  and  the  permutation  matrix  constraint  using  terms  that  are  linear  in  the 
permutation  matrix  M. 

Due  to  the  existence  of  unavoidable  symmetries  in  graph  matching  and  the  result¬ 
ing  global  minima,  a  symmetry-  breaking  term  is  added  in  order  to  always  obtain 
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a  permutation  matrix.  The  symmetry-breaking  term  is  similar  to  the  hysteretic 
annealing  performed  in  [14,15]  and  is  suitable  for  the  deterministic  approach  be¬ 
ing  used.  The  symmetry-breaking  term  is  reversed  via  another  fixpoint  preserving 
transformation.  The  network  then  performs  minimization  with  respect  to  the  La¬ 
grange  parameters  and  maximization  with  respect  to  the  permutation  matrix.  In 
11]  simulation  results  are  shown  for  the  isomorphism  problem  with  100  node  ran¬ 
dom,  undirected  graphs  and  for  the  weighted  graph  matching  problem  with  100 
node  random  graphs  with  uniform  noise  added  to  the  connections. 
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In  this  paper  the  differential  geometric  control  theory  is  used  to  define  the  key  concepts  of  relative 
degree  and  zero  dynamics  for  a  Dynamic  Recurrent  Neurcd  Network  (DRNN).  It  is  shown  that  the 
relative  degree  is  the  lower  bound  for  the  number  of  neurons  and  the  zero  dynamics  are  responsible 
for  the  approximating  capabilities  of  the  network. 

1  Introduction 

Most  of  the  current  applications  of  neural  networks  to  control  nonlinear  systems 
rely  on  the  classical  NARMA  approach  [1,2].  This  procedure,  powerful  in  itself,  has 
some  drawbacks  [3].  On  the  other  hand,  a  DRNN  is  described  by  a  set  of  nonlinear 
differential  equations  and  can  be  analysed  using  differential  geometric  techniques 

[4]. 

In  this  work,  the  important  concepts  of  zero  dynamics  and  relative  degree  from  the 
differential  geometric  control  theory  are  formulated  for  a  control  affine  DRNN. 

2  Mathematical  Preliminaries 

Consider  the  nonlinear  control  affine  system 


Xi 

=  X2 

^2 

=  Xs 

Xr~l 

—  Xr 

Xr 

~  fr{Xl,  .  .  . 

,  )  ^r+l  5  •  *  ■  >  ^n)  T 

Xr~\-1 

=  fr+l(xi, 

.  .  .  ,  Xr,  ,  •  •  •  j  Xn)  +  fi'r+l 

in 

—  fn{xi,.. 

.  ,  Xr,  )  •  •  •  )  ^n)  ‘ 

y 

=  Xi 

these  equations  can  be  written  in  compact  form 

X  =  f{x)  g  ■  u 

y  -  h{^)  (2) 

where  x  G  IR”,  u  G  IR,  y  €  IR,  f{x)  and  g  are  vector  fields,  h{x)  is  a  scalar  field.  For 
the  system  (1)  there  are  two  key  concepts  in  the  differential  geometric  framework, 
that  is  the  zero  dynamics  and  the  relative  degree. 

2.1  Zero  Dynamics 

The  zero  dynamics  of  the  system  (1)  describe  its  behaviour  when  the  output  y(t) 
is  forced  to  be  zero  [4].  With  the  output  zero,  the  initial  state  of  the  system  must 
be  set  to  a  value  such  that  (a^i(O), . . . ,  a;r(0))  are  zero,  whereas  (a?r-}-i(0), . . . ,  aJn(O)) 
can  be  chosen  arbitrarily.  In  addition,  the  input  u(t)  must  be  the  solution  of  the 
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equation 

0  =  /r(0,  .  .  .  ,  0,  Xr+l{t),  ,  .  .  ,  x„(i))  +  •  U  (3) 

Solving  for  u{t)  in  (3)  and  replacing  in  the  remaining  equations,  the  zero  dynamics 
are  given  by  the  set  of  differential  equations 

^r+l  /r  (0)  •  •  •  j  3?r+l  j  •  •  •  > 

9r 


(4) 


/n (0 j . . . ,  0,  ajy-f-i , . . . ,  ^72)  /7-(0, . . . ,  0,  a?7--f  1 ) . . . , 

9r 


The  zero  dynamics  play  a  role  similar  to  that  of  the  zeros  of  the  transfer  function 
in  a  linear  system.  If  the  zero  dynamics  are  stable  then  the  system  (1)  is  said  to  be 
minimum  phase. 


2.2  Relative  Degree 

The  relative  degree  of  a  dynamical  system  is  defined  as  the  number  of  times  that 
the  output  y(t)  must  be  differentiated  with  respect  to  time  in  order  to  have  the 
input  u{t)  appearing  explicitly  or  is  the  number  of  state  equations  that  the  input 
u(t)  must  go  through  in  order  to  reach  the  output  y{t). 

The  nonlinear  system  (2)  is  said  to  have  relative  degree  r  [3,4]  if 

LgVjh{x)  =  0,  z  =  0,...,r-2 

LgLy^h{x)  ^  0 

3  DRNN 

A  DRNN  is  described  by  the  set  of  nonlinear  differential  equations 

N 

ici  =  -Xi  +  Y1  (5) 

j=i 

y  =  Xu 

or  in  matrix  form 


X  =  ~x  +  w  ■  E(x)  +  r  •  «  (6) 

y  =  xi 

where  x  S  IR'^,  T  e  and  E(x)  =  Wxi).  •  •  ■ ,  <r(XAr)r. 

For  control  purposes  it  was  demonstrated  [3]  that  the  network  (6)  can  approximate 
nonlinear  systems  of  the  class  (2),  the  resulting  model  can  be  analysed  using  the 
differential  geometric  framework.  The  aim  of  this  paper  is  to  propose  a  canonical 
structure  for  the  network  (6)  in  order  to  get  any  desired  relative  degree  with  a  zero 
dynamics  similar  to  (4). 


Delgado  et  ah:  Zero  Dynamics  and  Relative  Degree 


163 


Theorem  1  The  DRNN  (6)  with  the  following  matrices  W  and  F  can  have  any 
relative  degree  r  ^[2,  N] 
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-1,  3 
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...,N 

in  other  terms 


This  particular  structure  is  called  the  staircase  structure.  Note  that  the  minimum 
number  of  neurons  is  the  relative  degree  r.  For  a  desired  relative  degree  r  ==  1,  the 
coefficient  71  must  be  nonzero. 

Proof  Applying  the  definition  of  relative  degree 

y  =  xi 

y  -  xi  =  wii  •  (t{xi)  +  0J12  •  o-{x2) 

after  the  first  derivative,  every  new  derivative  of  y{t)  introduces  a  new  state  equation 
(staircase  structure).  Then  after  r  derivatives  of  the  output  y{t),  the  input  u{t) 
appears  explicitly.  □ 

Theorem  2  The  network  of  the  Theorem  1  with  a  sigmoid  function  o'(O)  =  0  has 
a  zero  dynamics  described  by  the  set  of  differential  equations 
N 

Xi  =  -Xi  +  -  — ‘^rj)  •  o-(xj),  i  =  r+l,...,N 


Proof  The  zero  dynamics  are  defined  as  the  resulting  dynamics  when  the  output 
y(t)  is  constrained  to  be  zero.  In  the  structure  proposed  y(t)  =  0  means  that  the 
first  r  state  variables  are  zero  Xi(0  —  “ '  —  =  0  ^*^3  <T(xi(i))  =  = 

cr(xr(^))  =  0  The  equation  for  Xr(0  becomes 

N 

0=  ^  OJrj  •<T(Xj)  +  7r  ’U 

j=r+i 

solving  for  u 


1 

u  =  -—  ^  Wrj-<T{Xj) 
j=r+i 

replacing  the  control  u  in  the  last  N  —  r  equations 
N  N 

Xi  =  -Xi+  Yj  “'ij  Y 

j=r+l  j=:r+l 


2  =:  r  +  1 ,  .  .  .  ,  A 
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Finally 


ici  =  -Xi  +  XI 


■^rj)'cr{Xj)  z  =  r4-l,...,A' 


Example:  Consider  the  following  staircase  DRNN 
Xi  =  -Xi+<^ii  •<^(Xi)+^i2-<7-(X2) 

X2  =  --X2  +  ^21  •  <^(xi)  +  ^22  •  0‘(X2) +^23  •  ^^(Xs) +  T2 
X3  =  -X3 +  ^31  •  O'(xi) +^32  ■  f7'(X2)  +  W33  •  (T(x3)  +73  •  W 


Relative  Degree:  The  first  derivative  of  the  output  is 

y  =  Xi  =  “Xi  +  ^11  •  <^(xi)  +  ^12  •  <^(X2) 

the  input  u  does  not  appear  so  the  relative  degree  is  greater  than  one.  The  second 
derivative  of  the  output  is 

y  -  -Xi  +  ^11  •  ^^'(Xi)  •  Xi  +  wi2  •  (^'{X2)  ■  X2 

where 

Replacing  Xi  and  X2 

y  -  (-1  +  cjii  *  (t'(xi))(-xi  +  wii  •  <t(xi)  +  Wi2  •  ^^(X2)) 

+W12  •  <^'(X2)  •  (-X2  +  ^21  •  cr(xi)  +  ^22  •  <7(X2)  +  i*^23  ’  0^(X3)  +  72  *  «) 
notice  that  the  input  appears  explicitly  so  the  relative  degree  is  r  =  2. 

Zero  Dynamics:  The  proposed  network  has  three  states  and  relative  degree  two,  so 
the  zero  dynamics  has  one  state. 

Following  the  definition  of  zero  dynamics  [4],  y  —  xi  =  0.  The  first  state  equation 
is  reduced  to 

0  -  -0  +  wii  •  (t(0)  +  u;i2  •  o-(x2) 
this  yields  ;^2  =  00.  The  second  state  equation  is 

0  -  -0  4-  a;2i  •  u-(O)  +  0^22  ■  <^(0)  +  W23  *  ^^(xs)  +  72  ■  « 
solving  for  the  input 

^23  /  N 

u  = - a{x3) 

72 

replacing  u  in  the  remaining  state  equation,  the  zero  dynamics  is 

T3 

X3  =  -X3  +  (W33 - W23)  •  Cr(X3) 

72 

In  the  simulations  [5,  6]  a  single  link  manipulator  of  relative  degree  2  and  without 
zero  dynamics  was  identified  with  a  DRNN,  the  resulting  model  has  the  same 
relative  degree  but  it  needs  the  zero  dynamics  in  order  to  improve  the  plant 
approximation. 
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4  Conclusions 

A  DRNN,  is  described  naturally  by  a  set  of  nonlinear  differential  equations  and 
can  be  analysed  within  the  framework  of  the  differential  geometric  control  theory. 
There  are  two  key  concepts  in  this  framework  :  the  zero  dynamics  and  the  relative 
degree. 

Two  theorems  were  formulated  and  proved,  as  a  result  a  particular  structure  is 
proposed  for  a  DRNN  in  order  to  get  any  desired  relative  degree  r  6  [1,-^]  and 
the  canonical  zero  dynamics  (4).  The  relative  degree  is  a  lower  bound  for  the 
number  of  neurons  and  the  zero  dynamics  are  responsible  for  the  approxi¬ 
mating  capabilities  of  the  neural  network.  That  is,  for  multilayer  networks  the 
approximating  capabilities  reside  in  the  hidden  layers  and  for  recurrent  networks 
the  approximating  capabilities  reside  in  the  zero  dynamics. 
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The  use  of  feedforward  neural  networks  (FNNs)  for  non-linear  control  based  on  the  input-output 
discrete-time  description  of  systems  is  presented.  The  discussion  focusses  on  coupling  of  reconstruc¬ 
tion  techniques  for  n-D  irregularly  sampled  space-  and/or  band-limited  functions  with  feedforward 
neural  networks.  The  essence  of  the  approach  is  to  use  the  multi-dimensional  irregular  sampling 
theory  to  obtain  the  neural  network  representation  of  interpolating  filter  as  well  as  bounds  for 
accuracy  of  the  interpolation  from  the  finite  set.  Key  questions  from  the  relations  between  band- 
and  space-limited  functions  are  addressed  and  their  consequence  on  neurocontrol  are  underlined. 
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1  Introduction 

This  paper  focusses  on  the  discrete-time  input-output  model  of  a  deterministic 
non-linear  single-input  single-output  (SISO)  system 

y{k  +  1)  =  f{y{k), . .  .,y{k  -n-\-l),  u{k), . . .  ,u{k  ~  m 1))  (1) 

with  output  y  6  [a,  6]  C  IR,  and  input  u  €  [c,  c(|  C  IR  and  f:D  ^  [a,b]  with  the 
domain  of  definition  D  =  [a,  6]"  x  [c,d\^  C  IR”"''"'.  This  model  is  referred  to  as 
NARMA  (Non-linear  Auto-Regressive  Moving  Average)  model  (see  [1])  and  is  usu¬ 
ally  obtained  by  discretisation  of  a  deterministic  non-linear  (Lipschitz)  continuous¬ 
time  SISO  system  described  by  controlled  ordinary  differential  equations.  NARMA 
models  are  valid  only  locally  and  with  this  caveat  it  may  be  a  starting  point  for 
the  considerations  of  non-linear  dynamic  systems  modelling  in  the  context  of  neu¬ 
rocontrol  (see  [6]) 

2  Generation  of  Irregular  Samples  by  a  Dynamic  System 

In  practice  /  in  (1)  is  often  unknown,  and  modelling  has  to  be  based  on  the  given 
pairs  of  multi-dimensional  samples  {{yk, ...,  yk-n+uUk,  •  •  • ,  Ujt-m+i),  yk+i),  where 
we  put  yk  =  y{k)  etc.  for  brevity.  Given  the  samples 

^k  iVk  :  ’  '  ’  j  yk  —  n  +  l}  ^k  >  •  •  *  >  — m-fl) 

and  f{^k),  the  issue  is  to  reconstruct  the  multi- variable  function  /,  a  problem  from 
multi-dimensional  signal  processing  (note  that  it  is  completely  separate  from  the 
question  of  band-limiting  of  y).  The  approach  was  introduced  by  Sanner  k  Slotine 
[3],  but  they  assumed  that  the  multi-dimensional  samples  are  uniform,  i.e.,  regularly 
distributed  in  the  domain  D  of  /.  This  seems  to  be  a  simplification,  as  the  dynamics 
of  (1)  manifest  themselves  through  irregular  samples.  For  example,  if  /  is  linear,  i.e., 
yk+i  =  doVk  +  •  •  .-\-an-iyk-n+i  -\-hoUk  -i- . .  .-^bm-iUk-m+i,  then  even  for  constant 
input  the  output  will  not  take  values  in  constant  increments,  but  according  to  the 
slope  of  the  hyperplane  determined  by  f.  Thus  even  uniformity  of  u  cannot,  in 
general,  ensure  regular  distribution  of  values  of  y,  because  the  irregularity  of  the 
distribution  represents  /. 

We  now  examine  in  detail  the  nature  of  this  process  in  low  dimensions  (as  this  can 
be  illustrated  graphically).  We  look  at  the  way  nonuniform  samples,  i.e.,  (k  in  the 
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Figure  1  Iterative  map  /:[0,l]  X  [0,1]  — »■  [0,1]  defined  by  Vk+i  =  fivkiVk-i)- 


pairs  (^jfc, /(fjfc)),  are  generated  when  /  is  the  right-hand  side  (RHS)  of  a  dynamic 
system.  For  simplicity,  instead  of  dealing  with  a  controlled  system  of  type  (1),  we 
concentrate  on  the  low-order  autonomous  case 

Vk+i  =  f{yk,yk-i),  (2) 

with  y  e  [0,1].  Thus  =  (yk,yk-i)-  We  also  assume  that  f  is  continuous;  its 
domain,  D  =  [0, 1]^  for  (2),  is  compact  (and  connected). 

The  essential  observation  is  that  yk,  i.e.,  ^k  of  the  pairs  f{^k))  occur  in  the  xiX2 
plane  in  a  nonuniform  (irregular)  way.  They  will  be,  in  general,  unevenly  spaced 
and  their  pattern  of  appearance  will  depend  on  the  dynamics  of  (2),  or  the  shape 
of/:[0,l]x[0,l]^[0,l]. 

The  sample  points  =  {yk,yk-i)  appear  in  the  xiX2  plane,  according  to  the 
iterative  process  (2);  see  Fig.  1.  If  we  start  with  =  {yi,yo),  where  yi  €  0x2  and 
2/0  E  Oxi,  then  we  can  read  out  2/2  from  the  surface  representing  /.  Then  y2  is 
reflected  through  X3  =  X2  on  the  X2Xs  plane,  becoming  a  point  on  the  0x2  axis.  In 
the  same  time  2/1  is  reflected  through  X2  =  Xi  on  the  X1X2  plane,  becoming  a  point 
on  the  Oxi  axis.  This  results  in  the  point  (2/1, 2/2)  in  the  X1X2  plane,  corresponding 
to  the  sample  ^2  =  (2/2, 2/i)-  We  can  now  read  out  2/3  from  the  surface  representing 
/  and  repeat  the  process  for  =  3.  Now  2/3  ‘migrates’  from  Ox^  to  0x2  and  2/2 
from  0x2  to  Ox\  generating  the  point  (272,2/3)  on  the  xiX2  plane  corresponding  to 
the  sample  ^3  =  (2/3, 2/2)  etc. 

Thus,  we  have  to  address  the  issue  of  nonuniform  sampling  and  to  provide  at  least 
the  existence  conditions  for  function  recovery  with  any  degree  of  accuracy  from  a 
sufficiently  large  finite  number  of  irregular  samples  making  this  way  the  application 
of  neural  networks  plausible. 
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3  Band-limited  and  Space-limited  Functions 

The  crucial  question  we  are  dealing  with  is  whether  the  non-linear  functions  de¬ 
scribing  dynamical  systems  as  NARMA  models  (1)  are  strictly  band-limited  or 
space-limited  in  the  sense  usually  used  in  Signal  Processing.  Also  important  are 
the  consequences  of  these  properties  for  the  reconstruction  of  such  functions.  The 
related  question  is  how  their  extensions  to  the  whole  domain  functions  are 

constructed.  This  is  important,  because  we  are  going  to  use  the  harmonic  analy¬ 
sis  tools  and  especially  Fourier  Transform  or  Fourier  Series.  Thus  we  extend  /  to 
be  equal  to  0  outside  its  natural  domain  D  =  [a,  6]”  x  [c,d\^.  For  simplicity,  the 
subsequent  considerations  concern  mainly  the  1-D  case. 

We  shall  say  that  a  function  f(x)  is  hand-limited  if  its  Fourier  transform  (multi¬ 
dimensional)  is  zero  outside  a  finite  region,  and  its  energy  is  finite,  i.e.. 


F{u))  =  0  for  |a;|  > 


and 


1 


do;  <  oo. 


w  and  Q  being  vectors,  in  general.  Analogously  we  shall  say  that  a  function  is 
space-limited  in  multi-dimensional  case  if 


f{x)  =  0  for  |x|  >  X  and  E  <  oo. 

The  problem  is:  are  the  'usual’  functions  encountered  in  dynamic  systems  descrip¬ 
tion  really  band-limited  or  space-limited?  From  the  theory  of  complex  functions 
we  know  that  any  function  that  is  band-limited  is  analytic  in  its  domain,  i.e.,  is 
an  entire  function.  On  the  other  hand,  an  entire  function  cannot  vanish  on  an  in¬ 
terval,  except  for  the  case  f{x)  =  0;  thus  it  is  not  space-limited.  So  /  cannot  be 
band-limited  and  space-limited  at  the  same  time.  This  is  a  manifestation  of  the 
uncertainty  principle  of  harmonic  analysis.  For  if  we  want  to  localise  a  function 
f{x)  in  its  (spatial)  domain,  it  must  be  composed  of  very  many  sin  a; a:  and  thus 
Acj  must  be  large  and  vice-versa.  Generally,  the  uncertainty  principle  of  harmonic 
analysis  says:  Ji  is  impossible  for  a  non-zero  function  and  its  Fourier  transform  to 
be  simultaneously  very  small. 


4  Concentration  Problem  and  Prolate  Spheroidal  Wave 
Functions 

Knowing  that  a  non-trivial  band-limited  function  cannot  be  space-limited  we  nat¬ 
urally  come  to  the  question:  what  kind  of  approximately  space-limited  function 
corresponds  to  a  given  band-limited  function.  In  practical  terms  we  are  looking 
for  a  function  whose  ‘energy’  outside  the  given  spatial  region  is  small  enough  to  be 
indistinguishable  from  the  energy  of  (strictly)  space-limited  function.  Therefore,  we 
may  define  a  meaningful  measure  of  concentration  of  a  function  as 

2AV^_  /?^/2 

We  want  to  determine  how  large  can  be  for  a  given  band-limited  /;  dually 

one  may  define  an  appropriate  measure  of  concentration  of  the  amplitude  spectrum 
of  /(x),  say  Note  that  if  f[x)  were  indeed  space-limited  to  (— X/2,X/2), 

then  a^(X)  would  have  its  largest  value,  namely  unity.  To  solve  the  problem  of 
maximising  a^{X)  we  have  to  express  f{x)  in  terms  of  its  amplitude  spectrum 
F{u)  and  find  the  F{u)  for  which  a^(X)  achieves  maximal  value.  This  is  a  classical 
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problem  of  mathematical  physics  (see  [2])  and  we  know  that  the  maximising  F{u}) 
must  satisfy  the  integral  homogeneous  Fredholm  equation  of  the  second  kind 
sin7rT{Lj' 


/_ 


F(w")rfw"  =  a^{X)F{u'),  |w'|  <  n. 


(3) 


■n  ’r(w'  -  w") 

The  solutions  to  equation  (3)  are  known  as  prolate  spheroidal  wave  functions  (pswf) 
and  they  provide  a  useful  set  of  band-limited  functions  (see  [5]  for  more  details). 

5  The  ^XQ  Theorem 

Our  principal  question  was  how  well  the  band-limited  and  space-limited  functions 
(with  above  mentioned  restrictions)  are  suited  to  model  real-world  dynamic  au¬ 
tomatic  control  systems.  To  shed  more  light  on  it  let  us  recall  the  so-called  2XQ 
Theorem  [4].  Its  practical  engineering  formulation  says  that  if  Xfi  is  large  enough, 
then  the  space  of  functions  of  space  range  X  and  ‘bandwidth’  Q  has  dimension 
[2AT2] .  To  formulate  it  in  a  more  rigorous  manner  we  have  to  introduce  the  notion 
of  space-limited  and  band-limited  functions  in  a  way  avoiding  their  dependence  on 
the  detailed  behaviour  of  functions  or  their  Fourier  transforms  at  infinity. 

So,  we  say  that  a  function  f{x)  is  space-limited  in  multi-dimensional  case  to  the 
interval  (Xl^l.XjT)  at  level  e  if 

\f(x)\'^dx  <  e, 

'\x\>X/2 

i.e.,  if  the  energy  outside  this  space  region  is  less  than  it  is,  in  some  sense,  essential 
for  us.  The  same  way  we  say  that  a  function  is  band-limited  with  bandwidth  f]  at 
level  £  if 

I  \F(u})\^du  <  e, 

J\u>\>n/2 

i.e.,  the  energy  outside  the  frequency  range  is  less  than  the  value  we  are  interested 
in.  Using  these  newly  defined  functions  we  may  state  that  every  function  is  both 
space-limited  and  band-limited  at  level  £  (for  some  X  and  Q.)  as  opposed  to  only 
one  function  {f{x)  =  0)  which  is  both  space-limited  and  band-limited  in  the  strict 
sense. 

To  complete  the  reformulation  of  the  2XQ.  Theorem  we  need  one  more  definition. 
We  say  that  a  set  of  functions  T  has  an  approximate  dimension  N  at  level  e  on  the 
interval  (~X/2,  X/^)  if  there  exists  a  set  of  iV  —  N{X,  e)  functions  •  ■><I>N 

such  that  for  each  f{x)  G  F  there  exist  oi,  03, . . . ,  oat  such  that 

r  ^  l2 

\f(^)  ^  (4) 

■X/2  L  j 

and  there  is  no  set  of  —  1  functions  that  will  approximate  every  f{x)  G  F  this  way. 
This  definition  says,  in  other  words,  that  every  function  in  F  can  be  approximated 
in  — X/2  <  X  <  X/2  by  a  function  in  the  linear  span  of  (/>i,  (?5>2, . . . ,  <f>N^  so  that  the 
difference  between  the  function  and  its  approximation  is  less  than  e. 

Restated  version  of  the  theorem  has  now  the  following  form: 

Theorem  1  Let  F^  be  the  set  of  functions  space-limited  to  (— X/2,X/2)  at  level 
£  and  band-limited  to  {—Q,Q)  at  level  £.  Let  N{Q,X^£,£')  be  the  approximate 
dimension  of  Ft  at  level  e' .  Then  for  every  e'  >  £ 


J 


/_■ 


lim 


N{Q,X,£,£')  _ 


=  2fl. 


lim 

nt— *-00 


N{Q,X,£,£') 

Q 


2X. 
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In  fact  these  limits  do  not  depend  on  e  and  the  set  of  functions  which  in  real 
world  we  must  consider  to  be  limited  both  in  space  and  frequency  will  be  always 
asymptotically  2Arf2-dimensionaL 

6  Conclusions 

In  this  paper  we  have  shown  how  the  irregular  samples  are  generated  by  a  dy¬ 
namic  system.  We  argue  that  in  general  this  is  an  intrinsic  feature  of  the  NARMA 
model  and  our  attempts  to  reconstruct  the  function  /  should  be  set  in  the  ir¬ 
regular  sampling  context.  The  most  important  question  in  this  setting  is  the  one 
of  space-  and  band-limitedness  of  the  function  under  consideration.  The  result  of 
paramount  importance  from  the  point  of  view  of  function  approximation  by  finite, 
linear  combinations  of  functions  is  given  in  the  form  of  the  2XQ,  theorem.  Equation 
(4)  stipulates  the  existence  of  a  finite  approximation  of  a  given  nonlinear  function 
by  a  linear  combination  of  functions.  It  also  gives  the  lower  bound  for  the  number 
of  these  functions.  This  allows  the  application  of  neural  network  with  known  (finite) 
number  of  neurons. 

On  the  other  hand,  the  2XQ  theorem  is  also  interesting  from  the  point  of  view  of 
function  reconstruction  from  its  irregularly  spaced  samples.  In  this  case  we  have 
to  assume  that  our  function  to  be  reconstructed  is  band-limited.  From  practical 
considerations  it  heis  to  be  space-limited  as  well.  This  problem  is  normally  solved 
in  the  context  of  the  theory  of  complex  functions  analytic  in  the  entire  domain 
(entire  functions  and  their  special  types).  Let  us  notice  that  applying  the  2X0. 
Theorem,  instead  of  using  entire  functions  of  exponential  type  (as  is  usually  the 
case  in  irregular  sampling),  which  is  quite  restrictive,  we  deal  with  functions  which 
are  square-integrable — a  condition  that  is  easily  fulfilled  in  most  practical  Cctses. 
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An  unsupervised  learning  principle  is  proposed  for  individual  neurons  with  complex  synaptic 
structiu'e  and  dynamical  input.  The  learning  goal  is  a  neuronal  response  to  temporal  constancies: 
K  some  input  patterns  often  occur  in  close  temporal  succession,  then  the  neuron  should  respond 
either  to  all  of  them  or  to  none.  It  is  shown  that  linear  threshold  neurons  can  achieve  this  learning 
goal,  if  each  synapse  stores  not  only  a  weight,  but  also  a  short-term  memory  trace.  The  online 
learning  process  requires  no  biologically  implausible  interactions.  The  sequence  of  temporcd  asso¬ 
ciations  cem  be  interpreted  as  a  random  walk  on  the  state  transition  graph  of  the  input  dynamics. 
In  ntimerical  simulations  the  learning  process  turned  out  to  be  robust  against  parameter  changes. 

1  Introduction 

Many  neocortical  neurons  show  responses  that  are  invariant  to  changes  in  position, 
size,  illumination,  or  other  properties  of  their  preferred  stimuli  (see  review  [5]). 
These  temporal  “constancies”  or  “invariants”  do  not  have  to  be  inborn:  In  numerical 
simulations  [2,  3,  6]  neurons  could  learn  such  responses  by  associating  stimuli  that 
occur  in  close  temporal  succession.  Similar  temporal  associations  have  also  been 
observed  experimentally  [4]. 

In  most  numerical  simulations  [2,  6]  temporal  associations  were  formed  with  the  help 
of  “memory  traces”  which  are  running  averages  of  the  neuron’s  recent  activity:  If 
the  stimulus  Si  is  often  followed  by  the  stimulus  Sj ,  a  strong  response  of  the  neuron 
to  Si  causes  a  large  memory  trace  at  the  time  of  stimulus  Sj,  which  teaches  the 
neuron  to  respont  to  Sj ,  too.  By  presenting  the  stimuli  in  reverse  temporal  order 
Sj  — >•  Si,  the  response  to  stimuli  Si  is  likewise  strengthened  by  the  response  to 
stimulus  Si. 

This  simple  learning  scheme  does  no  longer  work,  if  the  input  dynamics  is  irre¬ 
versible,  that  is  if  only  the  transition  Si  Sj  occurs.  In  the  following  a  more 
complex  learning  scheme  will  be  presented,  which  can  form  associations  in  both 
temporal  directions  of  an  irreversible  input  dynamics.  For  this  purpose  an  addi¬ 
tional  memory  trace  will  have  to  be  stored  in  each  synapse.  Numerical  results  are 
presented  for  neuronal  input  generated  by  a  Markov  process  which  is  simpler  than 
naturally  occuring  input,  but  irreversible  and  highly  stochastic.  A  temporal  con¬ 
stancy  in  such  a  dynamics  consists  of  a  set  of  states  which  are  closely  connected 
to  each  other  by  temporal  transitions.  The  learned  synaptic  excitation  can  be  de¬ 
rived  from  the  neuronal  firing  pattern  by  interpreting  the  sequence  of  temporal 
associations  as  a  random  walk  between  Markov  states. 

2  Neuron  Model 

The  model  neuron  was  chosen  to  have  the  simple  activation  dynamics  of  a  linear 
threshold  unit,  but  a  complicated  learning  dynamics.  The  input  signal  is  generated 
by  a  temporally  discrete,  autonomous,  stochastic  dynamics  with  a  finite  number 
N  of  states  Si.  Each  state  is  connected  with  a  sensory  afferent.  At  any  time  t 
the  afferent  of  the  present  state  R(t)  G  {5o,...5'Ar}  is  set  active  {xj{t)  =  1  for 
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R(t)  =  Sj),  while  all  the  other  afFerents  are  set  passive  {xi{t)  =  0  for  R{t)  ^  Si). 
Each  afferent  forms  one  synapse  with  the  model  neuron.  The  sum  of  weighted 
inputs  is  called  the  neuron’s  activity  a{R(t))  =  activity  exceeds 

a  threshold  9,  the  neurons  output  y{R(t))  is  1,  otherwise  it  is  0. 

The  memory  trace  <p>r  of  any  quantity  p  is  defined  as  its  exponentially  weighted 
past  average  at  time  i.  The  subscript  r  indicates  over  which  time  scale  l/pr  the 
average  is  done  (with  0  <  <  1). 

oo 

<p>r{t)  •■=  (1) 

T  =  1 

^  <P>r{t  +  1)-  <J»r(<)  =  nr  '  (p(<)-  <F>r(0)  (2) 

The  neuron  model  includes  a  variable  threshold  9{i)  and  a  variable  synaptic  decay 
term  p{t),  which  serve  to  keep  the  neuron’s  average  output  y  and  activity  a  near 
the  preset  values  y,  a  >  0. 

e{t)  ■.=  <9>e{t)  •  (1  +  Pj,  •  {<y>e{t)  -  y))  (3) 

p{t)  -.=  <p>p{t)  +  <a>^(f)  ■  +  '//>  ■  (<«>/>(<)  -  «) j  (4) 

(Remember  that  the  memory  traces  <9>e{i)  and  <p>p{t)  at  time  t  do  not  depend 
on  0{t)  or  p{t).)  If  the  total  synaptic  change  Y^-  Awi  is  positive  and  if  the  activity 
a  exceeds  its  target  a,  p  will  decrease  on  a  time  scale  l/^p,  causing  a  subsequent 
decrease  of  synaptic  weights  in  eq.  (6).  The  time  scales  l/pe  and  l/ppoi  the  averages 
are  set  large  compared  to  the  recurrence  time  of  sensory  input  patterns.  The  rate 
r)y  ^  l/Z  determines  how  much  the  threshold  0  is  increased,  if  the  average  output 
<y>e  exceeds  its  target  y. 

Synaptic  learning  processes  determine  the  synaptic  weights  Wi  and  the  synaptic 
short-term  memory  traces  qi,  which  measure  the  contribution  that  a  synapse  has 
recently  made  to  the  neuronal  activity  a. 


9i(i) 

(5) 

:=  •n^Wi{t)xi{t)  ^  p{t)^  -f  Vwafz{t)qi(t) 

(6) 

The  decay  term  p{t)  was  defined  above.  The  past  and  future  prefactors  ctj  and 
will  be  discussed  below.  The  quantity  z{t)  models  the  postsynaptic  depolarization 
that  effects  the  learning  process.  It  is  defined  as  a  sum  z{t)  a{t)  +  7y(f)  of  the 
depolarization  a  caused  by  other  synaptic  inputs  and  the  depolarization  caused  by 
the  firing  y  of  the  postsynaptic  neuron,  weighted  by  a  factor  7  >  0. 

This  synaptic  learning  rule  can  be  rewritten  into  a  more  comprehensible  form  by 
defining  modified  synaptic  changes  Awp. 

(00 

^np  Xj  -’■)•(!-  npP~^^ 

00  \ 

By  inserting  definitions  (1)  and  (5)  into  eq.  (6)  one  can  easily  show  that  the  modi¬ 
fied  synaptic  changes  cause  approximately  the  same  total  changes  in  the  long  run: 
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^  YlJ-i  L^rui(t)  for  large  T.  The  approximation  is  good,  if  the  learn¬ 
ing  period  T  is  much  larger  than  the  duration  1/??/  of  the  synaptic  memory  trace 
qi.  It  is  exact,  if  the  depolarization  z(t)  is  zero  for  ^  <  1  and  t>T. 

The  definition  (7)  of  resembles  resembles  an  unsupervised  Widrow-Hoff-rule, 
in  which  the  desired  activity  of  the  neuron  depends  on  an  exponentially  weighted 
average  of  past  and  future  postsynaptic  depolarizations  z(/±r).  Because  the  future 
depolarizations  z{t  4*  r)  are  not  yet  known  at  time  the  memory  traces  qi  had  to 
be  introduced  in  order  to  construct  the  online  learning  rule  (6)  of  /\wi{t). 

3  Interpretation  of  Temporal  Associations  as  a  Random  Walk 

After  a  sufficiently  long  time  to,  any  successful  learning  process  should  reach  a 
quasistationary  state,  in  which  synaptic  changes  cancel  each  other  in  the  long  run. 
In  the  following  we  will  derive  the  neural  activities  a{Si)  in  the  quasistationary 
state  from  the  neural  outputs  y{Sj)  and  the  transition  rules  of  the  input  dynamics. 
Even  in  the  quasistationarity  state,  the  quantities  Wi,  0,  and  p  will  fluctuate  ir- 
regularily,  if  the  input  is  generated  by  a  stochastic  dynamics.  We  assume  that  the 
learning  rates  and  gp  are  so  small  that  these  fluctuations  can  be  neglected. 

Then  the  activity  a,  the  output  y,  and  the  depolarization  2:  a -\-jy  depend  on  time 
t  only  through  the  state  R{t)  of  the  input  dynamics.  Under  the  assumption  that  the 
input  dynamics  is  ergodic,  the  temporal  average  Awi{t-\-to)/T  can  be  replaced 
by  an  average  over  all  possible  trajectories  ...  R(—l)  7?(0)  — >  R{1)  ... 
of  the  input  dynamics.  By  deflnition,  quasistationarity  has  been  reached,  if  these 
average  weight  changes  vanish  for  all  synapses.  Using  definition  (7)  of  Awi,  the 
condition  of  quasistationarity  now  reads: 

/  00 

^  R(1),R(2),..\  T=1 

/  00 

^  ...,il(-2),J?(-l)\  T=1 

for  any  state  P(0).  Here  Pf{R{0)  -+  R(l)  . . .)  denotes  the  probability  mea¬ 
sure  that  a  trajectory  starting  at  P(0)  will  subsequently  pass  through  R{1),  R{T), 

_ Analogously,  Pp(. . .  — ^(“1)  -^(0))  denotes  the  relative  probabilities  of 

trajectories  ending  at  P(0). 

The  right  hand  side  of  eq.  (8)  can  be  interpreted  as  an  average  over  random  jumps 
between  states  of  the  input  dynamics.  Every  jump  starts  at  state  72(0).  It  jumps 
with  probability  aj/p  into  the  future  and  with  probability  ap/p  into  the  past. 
The  jump  length  r  is  exponentially  distributed,  with  a  mean  of  l/gj  (or  l/gp) 
time  steps  being  transversed.  If  the  dynamics  is  stochastic,  several  end  states  may 
be  reached  after  ±r  time  steps.  The  end  state  i2(±r)  is  then  chosen  according 
to  the  transition  probabilities  Pj  or  Pp,  which  were  defined  above.  The  effective 
depolarization  z  —  a  'yy  of  the  end  state  is  averaged  over  all  possible  jumps. 
According  to  eq.  (8),  this  average  should  equal  the  activity  a(72(0))  at  the  start 
state  72(0). 

The  activity  a(72(0))  at  the  start  state  still  depends  on  the  unknown  activity 
a(72(dbr))  at  the  end  state.  The  latter  can  again  be  interpreted  as  an  average  over 
random  jumps,  which  start  from  state  72(±r).  By  repeating  this  concatenation  of 
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Figure  1  It  is  the  information  which  a  first  spike  at  time  t  transmits  about  the 
state  of  the  input  dynamics  at  time  t +r .  A  neuron  responding  to  a  randomly  chosen 
set  of  32  states  would  convey  very  little  information  (staxs).  A  neuron  with  suitable 
learning  parameters  (see  text)  can  improve  its  response:  Circle:  response  to  an 
optimal  set  of  32  states  at  t  W  10®.  Triangle:  response  to  33  states  at  t  «  2-10®,  if  the 
“drift”  is  shghtly  positive  =  0.55;  Qp  =  0.45;  rjj  =  r}p  =  0.8).  Bowtie:  response 
to  31  states  at  t  Ri  2.5  •  10®,  if  a  maximal  drift  =z  or^  =  0)  is  compensated  by 
interactions  of  two  different  dendrites. 


jumps,  one  forms  a  random  walk  (which  should  not  be  confused  with  a  trajectory 
of  the  input  dynamics).  By  averaging  eq.  (8)  over  all  start  states  i?(0)  one  can 
easily  show  that  the  total  jump  probability  aj/p  ap/p  1/(1  +  jy/a)  <  1,  so 
that  the  random  walk  is  of  finite  mean  length  0/(7^).  Thus  one  can  sum  jy  over 
the  end  states  of  all  jumps  in  a  random  walk  and  average  this  sum  over  all  the 
random  walks  starting  at  Sj.  According  to  construction,  this  average  shold  equal 
the  activity  a{Sj),  once  that  the  learning  process  has  reached  quasistation arity. 

4  Numerical  Results 

The  learning  algorithm  was  tested  with  input  generated  by  a  special  Markov  pro¬ 
cess,  whose  high  internal  symmetry  permits  an  accurate  assessment  of  the  learn¬ 
ing  process.  Each  of  its  10-2^  =  320  equally  probable  states  is  denoted  by  a 
digit  sequence  C  with  C  G  {0, 1, . .  .9}  and  Bi  G  {0, 1}.  Each  state 

C  BiB2BsB^B^  forms  two  equally  probable  transitions  to  the  states  C  B253B4B5O 
and  C  52B3JB4B5I,  with  C  =  C  I  mod  10. 

With  suitable  learning  parameters  (y  =  0.1,  7  =  0.1,  af  =  ap  =  0.5,  and  I/77/  = 
1/t]p  =  1.25,  7]w  ~  0.04)  and  random  initial  synaptic  weights  the  neuron  needed 
1  million  time  steps  of  online  learning  to  develop  a  quasistation  ary  response.  It 
responded  to  32  states  of  the  Markov  process,  namely  the  8  states  0  JB2B3OI  (with 
B1B2B3  G  {000,001,.  ..Ill}),  the  8  states  1  J51J52OIB5,  the  8  states  2  B1OIB4B5, 
and  the  8  states  3  OIB3B4B5.  According  to  the  rules  of  the  input  dynamics,  these 
four  groups  of  states  are  always  passed  in  the  given  order.  Thus  the  neuron  always 
responds  for  4  consequetive  time  steps.  By  its  first  spike  (y(t)  =  1  after  y(t  —  l)  =  0) 
it  transmits  an  information  Ir  =  log2 (320/8)  5.3  about  the  state  R(t  =  r)  of 

the  input  dynamics  at  the  present  moment  r  =  0  and  the  next  three  time  steps 
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T  =  1,2  or  3  (see  fig.  1).  One  can  prove  that  this  is  one  of  the  “most  constant” 
responses  to  the  input  dynamics,  in  the  sense  that  no  neuron  with  mean  firing  rate 
y  =  0.1  can  transmit  more  information  L  by  its  first  spike  or  show  more  than 
4  consequetive  responses  at  some  times  without  showing  less  than  4  consequetive 
responses  at  other  times. 

A  deeper  analysis  of  the  properties  of  the  random  walk  showed  that  the  learning 
process  is  robust  against  almost  all  parameter  changes  (as  long  as  gj ,T)p,rjw,'ne,  Vp, 
and  7  remain  small  enough).  The  one  critical  learning  parameter  is  aj/gj  -  ap/gp, 
the  mean  drift  of  the  random  walk  into  the  future  direction  per  jump.  In  the  extreme 
case  ap  =  0  the  learning  goal  (8)  would  require  the  neuron’s  strongest  activity  to 
preceed  its  output  y  =  1  in  time,  which  is  inconsistent  with  the  output  being 
caused  by  the  neuron’s  strongest  activity.  Only  if  the  drift  was  rather  small,  did 
the  learning  process  converge  to  quasistationarity  (triangles  in  fig.  1). 

There  is  a  way  to  turn  the  learning  process  robust  against  changes  inaf  and  ap.  One 
constructs  a  neuron  with  two  different  dendrites,  differing  in  the  values  p,  <p>p, 
<Y^i  Awi>p,  and  <a>p  and  the  synaptic  learning  parameters  af  ,ap,gj ,gp:  In 
dendrite  a  the  drift  aj/gj  —  ap/gp  is  chosen  negative,  in  dendrite  b  the  drift  a^j/g^j  — 
ttp/gp  is  chosen  positive.  This  learning  process  always  reached  quasistationarity  in 
numerical  simulation.  The  early  part  of  neuronal  responses  was  caused  by  dendrite 
6,  the  late  part  by  dendrite  a.  The  worst  choice  of  learning  parameters  (maximal 
drift  =  ap  =  0)  still  produced  a  rather  good  neuronal  response  (bowties  in  fig. 
1).  The  known  anatomical  connections  in  the  neocortex  suggest  that  dendritic  type 
a  might  correspond  to  apical  dendrites  and  dendritic  type  b  to  basal  dendrites[l]. 
This  speculative  hypothesis  might  be  tested  by  simulating  networks  of  such  neurons. 
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In  previous  papers  e.g.  [5]  the  effect  on  the  learning  properties  of  filtering  or  other  preprocessing 
of  input  data  to  networks  was  considered.  A  strategy  for  adaptive  filtering  based  directly  on  this 
analysis  will  be  presented.  We  focus  in  Section  2  on  linear  networks  and  the  delta  rule  since 
this  simple  case  permits  the  approach  to  be  easily  tested.  Nmnerical  experiments  on  some  simple 
problems  show  that  the  method  does  indeed  enhance  the  performance  of  the  epoch  or  off  line 
method  considerably.  In  Section  3,  we  discuss  briefly  the  extension  to  non-linear  networks  and  in 
particuar  to  backpropagation.  The  algorithm  in  its  simple  form  is,  however,  less  successful  and 
current  research  focuses  on  a  practicable  extension  to  non-linear  networks. 

1  Introduction 

In  previous  papers  e.g.  [5]  the  effect  on  the  learning  properties  of  filtering  or  other 
preprocessing  of  input  data  to  networks  was  considered.  Such  filters  are  usually 
constructed  either  on  the  basis  of  knowledge  of  the  problem  domain  or  on  statistical 
or  other  properties  of  the  input  space.  However  in  the  paper  cited  it  was  briefly 
pointed  out  that  it  would  be  possible  to  construct  filters  designed  to  optimise  the 
learning  properties  directly  by  inferring  spectral  properties  of  the  iteration  matrix 
during  the  learning  process,  permitting  the  process  to  be  conditioned  dynamically. 
The  resulting  technique  is  naturally  parallel  and  easily  carried  out  alongside  the 
learning  rule  itself,  incurring  only  a  small  computational  overhead.  Here  we  explore 
this  idea. 

2  Linear  Theory 

Although  not  the  original  perceptron  algorithm,  the  following  method  known  as  the 
delta  rule  is  generally  accepted  as  the  best  way  to  train  a  simple  perceptron.  Since 
there  is  no  coupling  between  the  rows  we  may  consider  the  single  output  perceptron. 
Denote  the  required  output  for  an  input  pattern  x  by  y,  and  the  weights  by  the 
vector  w^.  Then, 

Sw  =  r){y  —  w^x)x 

where  r/  is  a  parameter  to  be  chosen  called  the  learning  rate  [7],p.322  .  Thus  given 
a  current  iterate  weight  vector  Wk, 

Wk-hi  =  Wk  +  r]{y  ~  Wk^x)x  =  {I  -  ??xx^)wk  +  yyx  (1) 

since  the  quantity  in  the  brackets  is  scalar.  The  bold  subscript  k  here,  denotes 
the  kth  iterate,  not  the  kth  element.  We  will  consider  a  fixed  and  finite  set  of 
input  patterns  Xp,  p  =  1 . .  .t,  with  corresponding  output  yp.  If  we  assume  that  the 
patterns  are  presented  in  repeated  cyclic  order,  the  presented  x  and  corresponding 
2/  of  (1)  repeat  every  t  iterations,  and  given  a  sufficiently  small  rj,  the  corresponding 
weights  go  into  a  limit  t-cycle:  see  [3]  or  [5].  Of  course  this  is  not  the  only  or 
necessarily  best  possible  presentation  scheme  [2],  but  other  methods  require  a  priori 
analysis  of  the  data  or  dynamic  reordering.  Since  we  are  assuming  that  we  have 
a  fixed  and  finite  set  of  patterns  Xp,  p  =  1 . .  .t,  an  alternative  strategy  is  not  to 
update  the  weight  vector  until  the  whole  epoch  of  t  patterns  has  been  presented. 
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This  idea  is  attractive  since  it  actually  generates  the  steepest  descent  direction  for 
the  least  sum  of  squares  error  over  all  the  patterns.  We  will  call  this  the  epoch 
method  to  distinguish  it  from  the  usual  delta  rule.  (Other  authors  use  the  term  off 
line  learning.)  This  leads  to  the  iteration 

t 

Wk+i  =  ^^Wk+i  +  77  X^(2/pXp)  (2) 

p=i 

where  Q  =  {l—gXX'^)  —  {I—rjL).  Here  X  is  the  nxt  matrix  whose  columns  are  the 
Xp’s.  The  k  in  (2)  is,  of  course,  not  equivalent  to  that  in  (1),  since  it  corresponds  to  a 
complete  epoch  of  patterns.  There  is  no  question  of  limit  cycling,  and,  indeed  a  fixed 
point  will  be  a  true  least  squares  minimum  w*.  To  see  this,  put  Wk+i  =  Wk  =  w* 
and  observe  that  (2)  reduces  to  the  normal  equations  for  the  least  squares  problem. 
Moreover  the  iteration  (2)  is  simply  steepest  descent  for  the  least  squares  problem, 
applied  with  a  fixed  step  length.  Clearly  L  =  XX'^  is  symmetric  and  positive  semi 
definite.  In  fact,  provided  the  Xp  span,  it  is  (as  is  well  known)  strictly  positive 
definite.  The  eigenvalues  of  are  1  —  r){the  corresponding  eigenvalues  of  X),  and 
hence  for  rj  sufficiently  small  g{Q)  =  ||fi||2  <  1-  Unfortunately,  however,  (2)  gen¬ 
erally  requires  a  smaller  value  of  77  than  (1)  to  retain  numerical  stability  [3],  [5]. 
How  can  we  improve  stability  of  (2)  or  indeed  (1)?  Since  these  are  linear  itera¬ 
tions  it  is  only  necessary  to  remove  the  leading  eigenvalue  of  the  iteration  matrix. 
Specifically  we  seek  a  matrix  T  such  that  if  each  input  vector  Xp  is  replaced  by 
Txp,  more  rapid  convergence  will  result.  We  see  from  (2)  that  the  crucial  issue  is 
the  relationship  between  the  unfiltered  update  matrix  Q  =  {1  —  gXX'^)  and  its 
filtered  equivalent  (/  -  gTXX'^'I^)  —  say.  In  general  these  operations  may  be 
defined  on  spaces  of  different  dimension:  see  e.g.  [5],  but  here  we  assume  T  isnxn. 
To  choose  T  we  compute  the  largest  eigenvalue  and  corresponding  eigenvevtor  of 
XX'^ .  This  may  be  carried  out  by  the  power  method  [6],  p.l47  at  the  same  time 
as  the  ordinary  delta  rule  or  epoch  iteration:  the  computation  can  be  performed  by 
running  through  patterns  one  at  a  time,  just  as  for  the  learning  rule  itself.  We  get 
a  normalised  eigenvector  pi  of  XX'^  corresponding  to  the  largest  eigenvalue  Ai  of 
XX'^.  Set 

r  =  /  +  (Ar'''"-i)piPi^.  (3) 

A  routine  calculation  shows  that  TXX'^T^  has  the  same  eigenvectors  as  XX^ , 
and  the  same  eigenvalues  but  with  Ai  replaced  by  1.  Each  pattern  Xp  should  then 
be  multiplied  by  T,  and,  since  we  are  now  iterating  with  different  data,  the  current 
weight  estimate  w  should  be  multiplied  by 

T-'  =  /+(A|/'-l)piPi^.  (4) 

We  may  then  repeat  the  process  to  remove  further  eigenvalues.  Basically  the  same 
idea  can  be  used  for  the  iteration  with  the  weights  updated  after  each  pattern 
as  in  (1),  but  it  is  less  convenient:  for  simplicity,  our  numerical  experiments  refer 
only  to  the  application  to  (2).  Two  small  examples  are  discussed:  in  each  case  four 
eigenvalues  are  removed  though  in  fact  for  the  first  example,  only  the  first  eigenvalue 
is  significant. 

The  first  example  is  a  ‘toy’  problem  taken  from  [5], pi  13.  There  are  four  patterns 
and  n  =  3  :  xi  =  (1,0,0)^,X2  =  (1, 1,0)^,  X3  =  (1, 1, 1)^  and  X4  =  (1,0, 1)^.  The 
corresponding  outputs  are  7/1  =  0, 2/2  =  J/3  =  2/4  =  1-  Figure  1  shows  the  number  of 
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Figure  1  Figure  2 

epochs  required  to  obtain  agreement  in  the  weights  (at  the  end  of  the  epoch  for  the 
delta  rule)  to  an  error  or  IE- 8,  for  various  values  of  r]  and  for  the  three  methods:  the 
delta  rule  (1),  the  epoch  or  off  line  method  (2)  and  the  Adaptive  Filtering  Method 
(AFT)  using  (3). 

As  a  slightly  more  realistic  example  we  also  considered  the  Balloon  Database  B  from 
the  UCI  Repository  of  Machine  learning  and  domain  theories  [1].  This  is  a  set  of 
sixteen  4-vectors  which  are  linearly  separable.  Figure  2  compares  the  perfomance  of 
the  epoch  and  AFT  methods:  on  this  very  well  behaved  database  the  epoch  method 
actually  performs  better  in  terms  of  iterations  than  the  delta  rule  even  though  it 
requires  a  larger  value  of  rj.  With  rj  =  I,  AFT  requires  only  three  epochs  and  good 
performance  is  obtained  over  a  wide  range,  whereas  the  best  obtainable  with  the 
epoch  method  is  28  with  r)  =  0.05  and  this  is  a  very  sharp  minimum:  t]  =  0.03 
requires  49  iterations,  and  rj  —  0.06  requires  45. 

3  Non-linear  Networks 

The  usefulness  of  linear  neural  systems  is  limited,  since  many  pattern  recognition 
problems  are  not  linearly  separable.  We  will  define  a  general  nonlinear  delta  rule. 
The  hackpropagation  rule  [7],  pp. 322-328  used  in  many  neural  net  applications  is  a 
special  case  of  this.  For  the  linear  network  the  dimension  of  the  input  space  and 
the  number  of  weights  are  the  same:  n  in  our  previous  notation.  Now  we  will  let 
M  denote  the  total  number  of  weights  and  n  the  input  dimension.  So  the  input 
patterns  x  to  our  network  are  in  IR”,  and  we  have  a  vector  w  of  parameters  in 
IR^  describing  the  particular  instance  of  our  network:  i.e.  the  vector  of  synaptic 
weights.  For  a  single  layer  perceptron  with  m  outputs,  the  vector  w  is  the  the  mxn 
weight  matrix,  and  thus  M  =  mn.  For  a  multilayer  perceptron,  w  is  the  cartesian 
product  of  the  weight  matrices  in  each  layer.  For  brevity  we  consider  just  a  single 
output.  The  network  computes  a  function  g  :  IR^  x  IR”  — +  IR.  In  [4]  or  [5]  it  is 
shown  that  the  generalised  delta  rule  becomes 

<5w  =  7?(7/ -  5r(w,  x)  Vfli(w,  x)  (5) 

Vg  takes  the  place  of  x  in  (1).  Observe  that  a  change  of  weights  in  any  given  layer 
will  cause  a  (linear)  change  in  the  input  vector  to  the  succesive  hidden  layer.  Thus 
the  required  gradient  is  obtained  by  i)  differentiating  the  current  layer  with  respect 
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to  the  weights  in  the  layer  and  ii)  multiplying  this  by  the  matrix  representation 
of  the  Frechet  derivative  with  respect  to  the  inputs  in  the  succeeding  layers.  Thus, 
let  the  kth  weight  layer,  k  =  1,2,.../C,  say,  have  weight  matrix  Wk  :  each  row 
of  these  matrices  forms  part  of  the  parameter  vector  w.  On  top  of  each  weight 
layer  is  a  (possibly)  non-linear  layer.  At  each  of  the  m  (say)  hidden  units  we  have 
an  activation  function:  hj  for  the  jth  unit.  The  function  h  whose  jth  co-ordinate 
function  is  hj  is  a  mapping  IR”^  — >  IR”^.  However  for  a  multilayer  perceptron  it 
is  rather  special  in  that  hj  only  depends  on  the  jth  element  of  its  argument:  in 
terms  of  derivatives  this  means  that  the  Jacobian  7?  of  h  is  diagonal.  Let  the  H 
and  h  for  the  units  layer  after  the  kth  weight  layer  also  he  subscripted  k.  (Input 
units  to  the  bottom  layer  just  have  identity  activation,  as  is  conventional.)  Finally 
suppose  that  the  input  to  the  kth  weight  layer  (i.e.  the  output  from  the  units  of 
the  previous  layer)  are  denoted  Vk,  with  Vi  =  x.  A  small  change  dW^  in  the  kth 
weight  matrix  causes  the  input  to  the  corresponding  unit  layer  to  change  by  Vk. 
The  Frechet  derivative  of  a  weight  layer  VF^Vr  with  respect  to  its  input  Vr  is  of 
course  just  Wr.  Thus  the  output  is  changed  by  HkWkHk~iWk-i  •  •  .HkhWkVk. 
Since  this  expression  is  linear  in  SWk  it  yields  for  each  individual  element  that 
component  of  the  gradient  of  g  corresponding  to  the  weights  in  the  kth  layer.  To 
see  this,  recall  that  the  gradient  is  actually  just  a  Frechet  derivative,  i.e.  a  linear 
map  approximating  the  change  in  output.  In  fact  we  might  as  well  split  up  Wk  by 
rows  and  consider  a  change  (^Wi^k)^  in  the  ith  row  (only).  This  corresponds  to 
5Wk  =  ei(6wi,k),  so  hWkV],  =  ei((5wi,k)vk  =  eiVk^(6wi,k)^  =  K-,Jfc(5wi,k)^,  say, 
where  Vi^k  is  the  matrix  with  ith  row  Vk,  and  zeros  elsewhere.  Thus,  that  section 
of  Wg  in  (5)  which  corresponds  to  changes  in  the  ith  row  of  the  kth  weight  matrix 
is  HkWk Hk-iWk-1  .  •  .HkVi^k'  The  calculation  is  illustrated  in  the  following  ex¬ 
ample.  Figure  3  shows  the  architecture.  The  shaded  neurons  represent  bias  neurons 
with  activation  functions  the  identity,  and  the  input  to  the  bias  neuron  in  the  input 
layer  is  fixed  at  1.  The  smaller  dashed  connections  have  weights  fixed  at  1,  and  the 
larger  dashed  ones  have  weights  fixed  at  0.  (This  approach  to  bias  has  been  adopted 
to  keep  the  matrix  dimensions  consistent.)  All  other  neurons  have  the  standard  ac¬ 
tivation  function  a  =  1/(1  -b  e®).  Other  than  the  bias  input,  the  data  is  the  same 
as  the  first  example  in  Section  2.  Let  the  initial  weights  be 
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between  the  inputs  and  hidden  layers  numbering  from  the  top  in  3.  Thus  no  bias 
is  applied  (but  weights  in  rows  1  and  2  in  the  last  column  are  trainable),  and  the 
bottom  row  of  weights  is  fixed.  Weights  are 

0.01  0.02  0.02 

between  hidden  and  output  layer,  so  a  bias  of  0.03  is  applied  to  the  output  neuron 
and  this  is  trainable.  With  input  Xi  =  (1, 0, 0, 0)^  and  yi  —  1,  we  find  that  (working 
to  4dp)  the  output  from  the  non-bias  hidden  units  are  0.5025  and  0.5050.  The  bias 
output  is  of  course  1.  The  output  from  the  output  unit  is  0.5113.  Using  the  fact  that 
the  derivative  of  the  standard  activation  function  is  a(l  —  a),  the  diagonal  elements 
of  the  3x3  matrix  Hi  are  0.2500,  0.2500  and  1  from  the  bias  unit.  The  off  diagonal 
elements  are,  of  course,  0.  ^^2  is  a  1  x  1  matrix  (i.e.  a  scalar)  with  value  0.2499. 
Ignoring  terms  corresponding  to  fixed  weights  we  have  11  trainable  weights:  8  below 
the  hidden  layer  and  3  above.  Vg  is  a  thus  an  11-vector  with  the  first  three  elements 
(say)  corresponding  to  the  hidden-to-output  weights.  (This  is  the  convenient  order¬ 
ing  for  backpropagation,  as  the  output  gradient  terms  must  be  computed  first.)  So 
the  first  three  elements  of  are  given  by  H2Vi^2  —  0.249914,2  =  0.2499vJ  (since 
ei  is  here  a  1-vector  with  element  1)  =  (0.1256,  0.1262,  0.2499).  Proceeding  to  the 
sub-hidden  layer  weights,  the  first  four  elements  are  given  by  i?2W2i?iVi,i.  The 
product  H2W2H1  evaluates  to  (0.0006,  0.0012,  0.0075).  Vi,i  —  eivj',  which  has 
(1,0, 0,1)  as  first  row  and  zeros  elsewhere.  Hence  elements  4  to  7  of  Vp  are  (0.0006, 
0,  0,  0.0006).  Similarly  elements  8  to  11  of  are  (0.0012,  0,  0,  0.0012). 
Experiments  with  the  adaptive  filtering  algorithm  have  as  yet  been  less  successful 
than  in  the  linear  case,  due  to  problems  of  underdetermination  and  non-stationarity. 
These  difficulties,  which  do  not  appear  to  be  insurmountable,  are  the  focus  of 
current  research. 
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RAM-based  neurcd  networks  are  designed  to  be  hardware  amenable,  which  affects  the  choice 
of  learning  algorithms.  Reverse  differentiation  enables  derivatives  to  be  obtained  efficiently  on 
any  architecture.  The  performance  of  four  learning  methods  on  three  progressively  more  difficiilt 
problems  are  compared  using  a  simulated  RAM  network.  The  learning  algorithms  are:  rewetrd- 
penalty,  batched  gradient  descent,  steepest  descent  and  conjugate  gradient.  The  applications  are: 
the  838  encoder,  character  recognition,  eind  particle  cleissification.  The  results  indicate  that  reward- 
penalty  can  solve  only  the  simplest  problem.  AD  the  gradient-based  methods  solve  the  particle 
task,  but  the  simpler  ones  require  more  CPU  time. 

1  Introduction 

The  driving  force  behind  RAM-based  neural  networks  is  their  ease  of  hardware 
realisation.  The  desire  to  retain  this  property  influences  the  design  of  learning  algo¬ 
rithms.  Traditionally,  this  has  led  to  the  use  of  the  reward-penalty  algorithm,  since 
only  a  single  scalar  value  needs  to  be  communicated  to  every  node  [7] .  The  math¬ 
ematical  tool  of  reverse  differentiation  enables  derivatives  of  an  arbitrary  function 
to  be  obtained  efficiently  at  an  operational  cost  of  less  than  three  times  the  orig¬ 
inal  function.  Using  three  progressively  more  complex  problems,  the  performance 
of  three  gradient-based  algorithms  and  reward-penalty  are  compared. 

2  HyperNet  Architecture 

HyperNet  is  the  term  used  to  denote  the  hardware  model  of  a  RAM-based  neural 
architecture  proposed  by  Gurney  [5],  which  is  similar  to  the  pRAM  of  Gorse  and 
Taylor  [4].  A  neuron  is  termed  a  multi-cube  unit  (MCU),  and  consists  of  a  number 
of  subunits,  each  with  an  arbitrary  number  of  inputs,  j  and  k  reference  nodes  in  the 
hidden  and  output  layers  respectively,  with  i  =  1, . . .,/  indexing  the  subunits.  // 
denotes  the  site  addresses,  and  is  the  set  of  bit  strings  /xi . . . . ,  where  n  denotes  the 
number  of  inputs  to  the  subunit.  Zc  refers  to  the  real- valued  input,  with  Zc  G  [0, 1] 
and  Zc  =  (1  —  Zc).  For  each  of  the  2”  site  store  locations,  two  sets  are  defined: 
c  G  M*q  if  //c  =  0;  c  G  if  //c  =  1.  The  access  probability  P{fJ.^^)  for  location  (j. 
in  subunit  i  of  hidden  layer  node  j  is  therefore  =  FIcgm’^  FIcgm’^  The 

subunit  response  (s)  is  then  gained  by  summing  the  proportional  site  values,  which 
are  in  turn  accumulated  to  form  the  multi-cube  activation  (a).  The  node’s  output 
(y)  is  given  by  passing  the  MCU  activation  through  a  sigmoid  transfer  function 
{y  =  a{a)  =  1/(1  4-  e“^)).  The  complete  forward  pass  of  a  two  layer  network  of 
HyperNet  nodes  is  given  in  Table  1.  in  the  form  of  a  computational  graph. 
Gradient-bcised  algorithms  require  the  extraction  of  error  gradient  terms.  Reverse 
differentiation  [10]  enables  derivatives  to  be  obtained  efficiently  on  any  architecture. 
Functions  are  usually  composed  of  simple  operations,  and  this  is  the  basis  on  which 
reverse  accumulation  works.  A  computational  graph  with  vertices  1  to  iV  connected 
by  arcs  is  constructed.  The  first  n  vertices  are  the  independent  variables,  with 
the  results  of  subsequent  basic  operations  /„(.)  stored  in  intermediate  variables 
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Computational  Graph 


Reverse  Accumulation 
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Table  1  Forward  and  reverse  accumulation  of  a  two  layer  network  of  real- valued 
HyperNet  nodes. 


Xu  =  /uC-)? ^  “  ^+1,  •  “,N  with  F{x)  =  XN.  The  gradient  vector  g{x)  =  dF{x)/dx 
can  then  be  obtained  by  defining  x  =  dxyjdxu^u  =  1,...,?;  for  vertex  u,  and 
applying  the  chain  rule.  The  process  thus  starts  at  vertex  N,  where  iciv  =  1,  and 
percolates  down.  The  reverse  accumulation  of  a  two  layer  HyperNet  network  is  also 
given  in  Table  1. 

3  Learning  Algorithms 

The  reward-penalty  algorithm  used  is  an  adaptation  of  the  P-model  associative 
algorithm  devised  by  Barto  [2]  and  modified  for  HyperNet  nodes  by  Gurney  [5]. 
A  single  scalar  reinforcement  signal,  calculated  from  a  performance  measure,  is 
globally  broadcast.  The  internal  parameters  are  updated  using  this  signal,  and  lo¬ 
cal  information.  The  performance  metric  used  is  the  mean-squared  error  on  the 
output  layer  (e  =  1/A*  where  is  the  target  output  for 

node  k,  and  iV*  is  the  number  of  output  layer  nodes).  The  binary  reward  sig¬ 
nal  is  then  probabilistically  generated:  r  =  1  with  the  probability  (1  —  e);  r  =  0 
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otherwise.  Given  a  reward  signal  (r  =  1)  the  node’s  internal  parameters  are  mod¬ 
ified  so  that  the  current  output  is  more  likely  to  occur  from  the  same  stimulus.  A 
penalising  step  (r  =  0)  should  have  the  opposite  effect.  The  update  is  therefore 
ASfi  =  a  [r(y  -  y)  -f  rA(l  -  y  -  y)]  where  a  is  the  learning  rate,  A  is  the  degree  of 
penalising,  and  y  =  y^^  for  output  layer  nodes  and  is  the  closest  extreme  for  hidden 
layer  nodes:  y  =  I  ii  y  >  0.5;  y  =  0  if  y  <  0.5;  oi  y  —  I  with  probability  0.5  if 
y  =  0.b. 

Gradient  descent  is  the  simplest  gradient  technique,  with  the  update  AS  =  —aS, 
where  a  is  the  step  size  and  S  is  the  gradient  term.  In  the  trials  reported  here, 
batched  updating  was  used,  where  the  gradients  are  accumulated  over  the  training 
set  (an  epoch)  before  being  applied.  Gradient  descent  has  also  been  applied  to 
RAM-based  nodes  by  Gurney  [5],  and  Gorse  and  Taylor  [4]. 

Steepest  Descent  represents  one  of  the  simplest  learning  rate  adaption  techniques.  A 
line  search  is  employed  to  determine  the  minimum  error  along  the  search  direction. 
The  line  search  used  was  proposed  by  Armijo  [1],  and  selects  the  largest  power  of 
two  learning  rate  that  reduces  the  network  error. 

Successive  steps  in  steepest  descent  are  inherently  perpendicular  [6],  thus  leading 
to  a  zig-zag  path  to  the  minimum.  A  better  search  direction  can  be  obtained  by 
incorporating  some  of  the  previous  search  direction.  Momentum  is  a  crude  ex¬ 
ample.  Conjugate  gradient  utilises  the  previous  direction  with  the  update  rule 
A5  =  a  6S  =  a{—S  -{-  ^6S~)  where  6S~  refers  to  the  change  on  the  previous 
iteration.  j3  is  calculated  to  ensure  that  successive  steps  are  conjugate,  and  was 
calculated  using  the  Polak-Ribiere  rule  [9]. 

4  Benchmark  Applications 

The  838  encoder  is  an  auto- associative  problem  which  has  been  widely  used  to 
demonstrate  the  ability  of  learning  algorithms  [6] .  Both  the  input  and  output  lay¬ 
ers  contain  n  =  S  nodes,  with  log2n  =  3  hidden  layer  nodes.  Presentation  of  n 
distinct  patterns  requires  a  unique  encoding  for  each  to  be  formulated  at  the  hid¬ 
den  layer.  Of  the  possible  encodings,  n!  are  unique.  Thus  for  the  838  encoder, 
only  40,320  unique  encodings  exist  in  the  16,777,216  possible,  or  0.24%.  The  net¬ 
work  is  fully  connected,  with  every  MCU  containing  only  one  subunit.  The  training 
vectors  consist  of  a  single  set  bit,  which  is  progressively  shifted  right. 

The  character  set  database  is  a  subset  of  that  utilised  by  Williams  [11],  and  con¬ 
sists  of  twenty-four  examples  of  each  letter  of  the  alphabet,  excluding  ‘P  and  ‘O’. 
Each  are  sixteen  by  twenty-four  binary  pixels  in  size,  and  were  generated  from  UK 
postcodes.  Nineteen  examples  of  each  letter  form  the  training  set.  A  randomly  gen¬ 
erated  configuration  was  utilised,  with  each  pixel  mapped  only  once.  Eight  inputs 
per  subunit,  and  seven  subunits  per  MCU  resulted  in  seven  neurons  in  the  hid¬ 
den  layer.  The  output  layer  contains  twelve  nodes,  each  with  a  single  subunit  fully 
connected  to  the  hidden  layer. 

The  particle  scattering  images  were  generated  by  a  pollution  monitoring  instru¬ 
ment  previously  described  by  Kaye  [8].  Data  was  collected  on  eight  particle  types, 
namely:  long  and  short  caffeine  fibres;  12^m  and  3//m  silicon  dioxide  fibres;  copper 
flakes;  3/im  and  4.3/im  polystyrene  spheres;  and  salt  crystals.  The  training  set  com¬ 
prised  fifty  randomly  selected  images  of  each  type,  quantised  to  16^  5-bit  images 
to  reduce  computational  load.  A  fully  connected  network  was  used,  with  sixteen 
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hidden,  and  six  output  layer  MCUs.  Every  MCU  consisted  of  two  subunits,  each 
with  eight  randomly  generated  connections.  A  more  complete  description  of  the 
airborne  particle  application  can  be  found  in  [3]. 

5  Experimental  Results 

The  site  store  locations  were  randomly  initialised  for  all  but  reward-penalty,  where 
they  simply  set  to  zero.  The  convergence  criteria  was  100%  classification  for  the 
838  encoder,  and  95%  for  the  other  applications.  Parameter  settings  were  gleaned 
experimentally  for  the  reward-penalty  and  gradient  descent  algorithms.  The  gradi¬ 
ent  descent  setting  of  p  was  also  used  for  steepest  descent  and  conjugate  gradient. 
The  results  are  averaged  over  ten  networks  for  the  simpler  problems,  and  five  for 
the  particle  task.  Table  2  summarises  the  parameter  settings,  and  results  for  the 
four  algorithms  on  the  three  applications.  “Maximum”,  “mean”,  “deviation”,  and 
“coefficient”  are  in  epochs,  with  the  latter  two  being  the  standard  deviation  and 
coefficient  of  variation  (f/  |  ■  100)  respectively.  *  denotes  unconverged  networks, 

and  hence  the  maximum  cycle  limit.  “CPU  /  cycle”  is  based  on  actual  CPU  time 
(secs)  required  on  a  Sun  SPARCstation  10  model  40.  “Total  time”  is  given  by 
“mean”  x  “CPU  /  cycle”. 

For  the  838  encoder,  conjugate  gradient  was  the  fastest,  requiring  marginally  less 
time  than  reward-penalty  and  almost  six  and  a  half  times  fewer  cycles.  The  char¬ 
acter  recognition  task  highlights  the  difference  in  learning  ability.  Reward-penalty 
was  unable  to  converge,  being  more  than  three  orders  of  magnitude  away  from  the 
desired  error.  Batched  gradient  descent  again  demonstrated  consistent  learning,  but 
was  still  the  slowest  of  the  gradient-based  algorithms.  The  results  for  the  particle 
task  exemplify  the  problem  of  steepest  descent:  every  network  became  trapped  and 
required  a  relatively  large  number  of  cycles  to  be  freed.  The  power  of  conjugate  gra¬ 
dient  also  becomes  clear,  needing  five  times  fewer  cycles  then  gradient  or  steepest 
descent.  Reward-penalty  was  not  tried  due  to  its  failure  on  the  simpler  character 
problem. 

6  Conclusions 

The  performance  of  various  learning  algorithms  applied  to  a  RAM-based  artificial 
neural  network  have  been  investigated.  Traditionally,  reward-penalty  has  been  ap¬ 
plied  to  these  nodes  due  to  its  inherent  hardware  amenability.  The  experiments 
reported  here  suggest  that  reinforcement  learning  is  bested  suited  to  simple  prob¬ 
lems.  With  respect  to  the  gradient-based  algorithms,  gradient  descent  was  consis¬ 
tently  the  slowest  algorithm.  While  steepest  descent  was  the  fastest  on  the  character 
recognition  task,  it  had  a  tendency  to  become  trapped.  Conjugate  gradient  was  by 
far  the  best  algorithm,  being  fastest  on  two  of  the  three  applications. 
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Task  Parameters 

Reward- 

Penalty 

Gradient 

Descent 

Steepest 

Descent 

Conjugate 

Gradient 

P 

0.35 

0.21 

0.21 

0.21 

A 

0.7 

NA 

NA 

NA 

Oi 

32 

4 

NA 

NA 

Maximum 

449 

578 

107 

75 

838  Encoder  .  Mean  (^) 

239 

499 

55 

37 

Deviation  (s) 

181 

74 

32 

23 

Coefficient  (1^) 

76 

15 

58 

62 

CPU  /  cycle  (secs) 

0.004 

0.006 

0.020 

0.024 

Total  time  (secs) 

0.980 

2.944 

1.089 

0.881 

P 

1.75 

0.21 

0.21 

A 

0.8 

NA 

NA 

NA 

a 

4 

2 

NA 

NA 

Alphabetic  Maximum 

259 

63 

Character  Mean  (x) 

- 

244 

50 

41 

Recognition  Deviation  (s) 

- 

19 

8 

14 

Coefficient  (U) 

- 

8 

16 

34 

CPU  /  cycle  (secs) 

6.47 

13.85 

32.45 

Total  time  (secs) 

- 

1579 

692 

1331 

P 

- 

0.9 

0.9 

0.9 

Oi 

- 

1 

NA 

NA 

Maximum 

_ 

1014 

1190 

197 

Airborne  (jj 

_ 

828 

890 

156 

Particle  Deviation  [s) 

_ 

111 

250 

49 

Classification  Coefficient  (V) 

- 

13 

28 

31 

CPU  /  cycle  (secs) 

- 

56.44 

133.93 

190.55 

Total  time  (secs) 

- 

46733 

119199 

29726 

Table  2  Summary  of  the  performance  of  the  training  algorithms  applied  to  the 
benchmark  tests. 
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This  paper  describes  new  work  on  partial  match  using  Correlation  Matrix  Memory  (CMM),  a 
type  of  binary  associative  neural  network.  It  has  been  proposed  that  CMM  can  be  used  as  an 
inference  engine  for  expert  systems,  and  we  suggest  that  a  peirtial  match  ability  is  essential  to 
enable  a  system  to  deal  with  real  world  problems.  Now,  an  emergent  property  of  CMM  is  an 
ability  to  perform  partial  match,  which  may  make  CMM  a  better  choice  of  inference  engine  than 
other  methods  that  do  not  have  partial  match.  Given  this,  the  partial  match  characteristics  of 
CMM  have  been  investigated  both  medytically  and  experimentally,  and  these  characteristics  £ire 
shown  to  be  very  desirable.  CMM  p^lrtial  match  perfoimmce  is  also  compared  with  a  standard 
database  indexing  method  that  supports  partial  match  (Multilevel  Superimposed  Coding) ,  which 
shows  CMM  to  compare  well  under  certain  cirumstances,  even  with  this  heavily  optimised  method. 
Parallels  are  drawn  with  cognitive  psychology  and  human  memory. 

Keywords:  Correlation  Matrix  Memory,  Neural  Network,  Partial  Match,  Hixman  Memory. 

1  Introduction 

Correlation  Matrix  Memory  (CMM)  [4]  has  been  suggested  as  an  inference  engine, 
possibly  for  use  in  an  expert  system  [1].  Touretsky  and  Hinton  [7]  were  the  first  to 
implement  successfully  a  reasoning  system  in  a  connectionist  architecture,  and  there 
have  been  several  neural  network  based  systems  suggested  since.  However,  only  the 
systems  [1,  7]  can  be  said  to  use  a  truly  distributed  knowledge  representation  (the 
issue  of  localist  versus  distributed  knowledge  representation  will  not  be  discussed 
here),  for  instance  SHRUTI  [6]  is  an  elaborate  system  using  temporal  synchrony. 
The  SHRUTI  model  was  developed  with  a  localist  representation  and,  although 
a  way  of  extending  the  model  to  incorporate  a  semi-distributed  representation  is 
given,  this  is  not  a  natural  extension  of  the  model;  moreover,  it  is  difficult  to  see 
how  learning  could  occur  with  either  form  of  knowledge  representation.  Many  of 
the  properties  that  have  become  synonymous  with  neural  networks  actually  rely 
on  the  use  of  a  distributed  knowledge  representation.  These  are:  (a)  a  graceful 
degradation  of  performance  with  input  data  corruption;  (b)  the  ability  to  interpolate 
between  input  data  and  give  a  sensible  output,  and  (c)  a  robustness  to  system 
damage.  Partial  match  ability  is  very  much  connected  with  (a)  and  (b),  which 
give  neural  network  systems  their  characteristic  flexibility.  Here  we  suggest  that  a 
certain  flexibility  in  reasoning  is  invaluable  in  an  inference  engine  because,  in  real 
world  problems,  the  input  data  is  unlikely  to  be  a  perfect  match  for  much  of  the 
time.  The  CMM-based  inference  engine  suggested  by  Austin  [1]  uses  a  distributed 
knowledge  representation,  and  therefore  this  system  can,  in  principle,  offer  the 
desired  flexibility.  The  focus  of  this  paper  is  a  full  analysis  of  the  partial  match 
ability  of  CMM,  including  a  comparison  with  a  conventional  database  indexing 
method  that  offers  partial  match.  This  conventional  database  indexing  method  is 
Multilevel  Superimposed  Coding  [3].  Section  2  contains  a  more  detailed  description 
of  CMM  and  partial  match,  including  an  analysis  of  the  theoretical  partial  match 
performance  of  CMM.  Section  3  compares  CMM  partial  match  with  Multilevel 
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Superimposed  Coding,  and  Section  4  considers  how  these  results  may  be  relevant 
to  cognitive  psychology  and  the  study  of  human  memory. 

2  CMM  and  Partial  Match 

CMM  is  a  type  of  binary  associative  memory,  which  can  be  thought  of  as  a  matrix  of 
binary  weights.  The  fundamental  process  which  occurs  in  CMM  is  the  association 
by  Hebbian  learning  of  binary  vectors  representing  items  of  information,  which 
subsequently  allows  an  item  of  information  to  be  retrieved  given  the  appropriate 
input.  CMM  allows  very  fast  retrieval  and,  particularly,  retrieval  in  a  time  which 
is  independent  of  the  total  amount  of  information  stored  in  the  memory  for  given 
CMM  size.  In  [1],  a  symbolic  rule  such  as  X=^Y  can  be  encoded  by  associating  a 
binary  vector  code  for  X,  the  rule  antecedent,  with  a  binary  vector  code  for  Y,  the 
rule  consequent.  The  rule  can  be  recalled  subsequently  by  applying  the  code  for 
X  to  the  memory  and  retrieving  the  code  for  Y.  Multiple  antecedent  items  can  be 
represented  by  superimposing  (bitwise  OR)  the  binary  vector  codes  for  each  item 
prior  to  learning.  A  fixed  weight,  sparse  encoding  is  used  to  generate  codes  with 
a  coding  rate  that  optimizes  storage  wrt.  number  of  error  bits  set  in  the  output, 
which  are  bits  set  in  error  due  to  interactions  between  the  binary  representations 
stored  in  the  CMM  [11].  CMM  learning  is  given  by: 

=  M'®IO  (1) 

where  AT*  is  the  mxm  CMM  at  iteration  i,  I  and  O  are  m-bit  binary  vector  codes 
to  be  associated,  and  0  means  OR  each  corresponding  element.  Hence  to  obtain 
CMM  at  iteration  2  +  1,  I  and  O  are  multiplied  and  the  result  ORed  with  M*.  Note 
that  learning  is  accomplished  in  a  single  iteration.  CMM  retrieval  is  given  by: 

T  =  1  M  (2) 

0  =  L-max(r,l)  (3) 

where  I  would  be  one  of  a  pair  of  vectors  previously  associated  in  M,  and  the 
function  L— max(r,  1)  [2]  selects  the  /  highest  integers  in  the  integer  vector,  F,  to  be 
the  I  set  bits  in  the  output,  O.  Storage  wrt.  error  is  maximised  when  [11]: 

n  =  log2  m  (4) 

where  n  is  the  number  of  set  bits  in  each  m-bit  binary  vector;  usually,  we  choose 
/  =  n.  In  a  symbolic  rule  matching  context,  partial  match  would  allow  a  rule  that 
has  multiple  items  in  the  antecedent  to  fire  when  only  a  subset  of  the  items  are 
present.  In  terms  of  the  binary  vector  code  representation  being  considered  here 
and  if  the  input  to  the  CMM  contains  set  bits,  a  full  match  input  is  characterised 
by  the  greatest  possible  number  of  set  bits  (z  —  n),  while  a  partial  match  has  less 
set  bits  than  this  {j  <  z  <  n,  where  k  is  the  number  of  antecedent  items).  Because 
of  the  way  in  which  multiple  input  data  are  superimposed,  CMM  partial  match 
is  em  insensitive  to  the  order  in  which  items  are  presented  as  input.  For  example, 
consider  the  following  symbolic  rule: 

engine_stalled  A  ignition_ok  A  fuelgaugeJow  =>  refuel 

Normally,  to  check  if  a  subset  {engine jstalled  ,  ignition_ok}  is  a  partial  match,  it 
would  be  necessary  at  least  to  look  at  all  the  possible  orderings  of  the  subset,  but 
with  CMM  this  is  not  necessary.  We  term  this  emergent  property  Combinatorial 
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Partial  Match.  CMM  could  equally  be  used  in  frame  based  reasoning,  where 
partial  match  would  allow  an  entire  frame  to  be  recalled  given  a  subset  of  the 
frame. 

Partial  Match  Performance 

If  each  input  learned  by  the  CMM,  /,  is  obtained  by  hashing  and  superimposing 
the  attributes  that  identify  each  record,  then  the  CMM  can  be  used  subsequently 
to  locate  the  record  in  memory,  using  I  (or  a  partial  match  for  7),  by  associating 
I  with  an  output  code,  O,  that  represents  the  location  of  the  record.  Hence  CMM 
can  be  used  as  a  database  indexing  method.  However,  if  multiple  matches  occur 
and  multiple  locations  are  represented  in  the  CMM  output,  then  output  codes  may 
be  so  overlapped  that  it  is  impossible  to  identify  individual  codes.  Therefore,  a 
method  is  needed  to  decode  the  CMM  output  but  giving  the  minimum  number 
of  false  positives  (records  that  appear  to  be  a  match  in  the  CMM  but  are  not 
true  matches)  and  no  false  negatives  (records  that  appear  not  to  be  a  match  in 
the  CMM  but  are  true  matches).  We  suggest  here  a  novel  method  of  identifying 
which  output  codes  may  be  present  in  the  CMM  output,  whereby  output  codes  are 
subdivided  and  stored  in  buckets  in  memory,  according  to  where  the  middle  set  bit 
(MB)  appears  in  each  code  (Fig.  1).  This  approach  allows  us  simply  to  identify  all 
set  bits  in  the  output  that  could  be  MBs,  then  if  all  the  buckets  corresponding  to 
these  bits  are  searched,  we  are  guaranteed  no  false  negative  matches. 


Figure  1  How  output  codes  are  divided 
up  into  buckets  in  secondary  storage  ac¬ 
cording  to  the  position  of  the  middle  set 
bit  (MB);  in  general,  m~  (n  —  l)  buckets 
are  needed. 


Figure  2  Multilevel  Superimposed 
Coding  (from  [3]). 


The  partial  match  performance  of  CMM  is  considered  in  terms  of  the  fraction,  7, 
of  the  index  memory  (which  may  be  held  in  a  slow  secondary  storage  device)  that 
must  be  searched  in  order  to  answer  a  given  partial  match  query;  the  smaller  this 
fraction  is,  the  better.  Other  measures  of  performance  are:  computation  reduction 
(the  proportion  of  the  total  number  of  bits  in  the  index  that  must  be  searched  — 
not  considered  here),  and  the  physical  size  of  the  index  (in  the  comparison,  we  allow 
both  methods  the  same  index  size).  We  show  that  this  fraction,  7,  depends  only 
on:  m,  the  dimension  of  the  mxm  CMM;  z,  the  number  of  set  bits  in  7,  and  t,  the 
expected  number  of  matches  that  will  be  found.  If  we  first  consider  the  absolute 
amount  of  memory  that  must  be  searched,  this  amount  is  due  to  two  components: 
wi,  the  amount  of  memory  that  must  be  searched  in  order  to  obtain  the  CMM 
output,  and  0/2,  the  amount  of  memory  representing  output  codes  that  must  be 
searched  in  order  to  decode  the  CMM  output.  For  wi,  only  z  rows  of  the  mxm 
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CMM  are  needed  to  obtain  the  CMM  output,  hence: 

oji  =  zm  bits  (5) 

To  derive  an  expression  for  (J2)  we  first  look  at  the  expected  number  of  matches 
with  items  stored  in  the  CMM  that  will  be  found,  t,  which  gives  us  the  expected 
number  of  set  bits  in  the  CMM  output  due  only  to  true  matches: 

/  m  —  n  \  f 


Kl  =  (1  - 


)m  bits 


But  we  must  also  take  account  of  erroneously  set  bits,  which  occur  due  to  inter¬ 
actions  between  the  binary  representations  stored  in  the  CMM  (see  [11]  for  an 
eloquent  explanation  of  why  this  occurs): 

/C2  =  (m  —  n)q^  bits  (7) 

where  q  is  the  proportion  of  all  m?  bits  in  the  CMM  that  are  set,  which  at  maximum 
storage  is  50%.  ki  K2  is  now  the  total  expected  number  of  set  bits  in  O;  however, 
for  our  purposes,  we  can  ignore  n  - 1  of  these  bits  because  these  n  —  l  bits  represent 
bits  that  cannot  possibly  be  MBs^ ,  giving: 

K3  =  Ki  -|-  ^2  —  (n  —  1)  bits  (8) 

/C3  is  the  number  of  set  bits  in  O  that  could  be  MBs,  or  equivalently  the  number  of 
output  code  buckets  that  must  be  searched.  Now,  each  output  code  bucket  contains 
on  average  output  codes,  where  p  is  the  number  of  records  in  the  CMM, 

and  each  output  code  can  be  encoded  in  bits,  hence: 


(n-1) 


Ks  bits 


Note  that,  for  a  given  p,  if  m  is  chosen  such  that  q  =  50%,  then  p  «  [5]  (i.e. 

p  may  be  expressed  in  terms  of  m).  We  can  now  write: 

1=—^', - 2 

-|-  pn^ 

This  result  has  been  verified  experimentally  (up  to  m  =  1000,  however  a  lack  of 
space  unfortunately  does  not  permit  the  discussion  of  these  experimental  results 
here). 

3  A  Comparison  with  Multilevel  Superimposed  Coding 

Multilevel  Superimposed  Coding  (MSC)  [3]  has  been  chosen  as  the  conventional 
database  indexing  method  that  supports  partial  match  because  of  the  similarities 
of  this  method  with  our  method,  also  MSC  is  used  by  several  companies  (e.g. 
Daylight  CIS  Inc.)  for  indexing  their  large  databases.  In  MSC  (Fig.  2),  records  are 
identified  by  hashing  and  superimposing  attribute  values  into  a  binary  code  using 
multiple  hash  functions  to  gain  access  at  multiple  levels  to  a  binary  tree.  Note  that 
the  resultant  codes  are  of  a  different  length,  which  is  why  multiple  hash  functions 
are  needed.  As  before,  the  idea  is  to  search  as  little  of  the  index  as  possible,  and  the 
ideal  state  of  affairs  when  locating  a  single  record  would  be  if  at  each  node  only  one 
branch  were  taken.  However,  if  the  index  is  poorly  conditioned  then  all  branches 
may  need  to  be  searched,  which  would  then  be  equivalent  to  performing  a  linear 
search  on  the  entire  index. 


^Consider  O  comprising  just  one  output  code:  O  has  n  set  bits,  only  one  of  which  can  be  a 
middle  set  bit  (MB);  therefore  n  -  1  of  the  set  bits  are  not  MBs  (this  generalises  to  the  general 
case  of  O  comprising  multiple  output  codes). 
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Figure  3  A  comparison  in  terms  of  partial  match  performance  between  CMM 
(left)  and  MSC  (right,  (2);  from  [3]),  for  varying  expected  number  of  matches. 
Overall  index  size  =  1.5  GB  (both  methods). 


Kim  and  Lee  [3]  consider  an  example  case  of  an  index  for  a  database  of  2^^  records, 
for  which  MSC  requires  a  24  level  tree  with  hash  functions.  The  analysis  for 
MSC  relies  on  specifying  t,  the  expected  number  of  matches,  and  both  methods 
are  allowed  the  same  amount  of  storage  (1.5  GB).  The  results  are  shown  in  Fig.  3, 
and  concern  the  fraction  of  each  index  that  must  be  searched  versus  the  expected 
number  of  matches.  Remembering  that  a  full  match  input  is  characterised  by  the 
greatest  possible  number  of  set  bits  {z  =  n),  while  a  partial  match  has  less  set 
bits  than  this  (j  <  z  <  n),  we  observe  from  Fig.  3  that  CMM  approaches  the 
performance  achieved  by  MSC  provided  (1)  that  the  input  is  well  specified,  and 
(2)  that  few  true  matches  exist  in  the  database.  Even  with  a  less  well  specified 
input  and  with  several  matches,  CMM  still  performs  reasonably  well  (especially 
considering  the  relative  simplicity  of  the  CMM  method  in  comparison  with  the 
heavily  optimised  and  much  less  straightforward  MSC  method). 

4  Implications  for  Cognitive  Psychology 

In  presenting  this  work  to  colleagues  from  differing  backgrounds  —  some  in  psychol¬ 
ogy  —  it  has  become  clear  that  this  work  may  have  relevance  to  cognitive  psychology 
and  the  study  of  human  memory.  The  Encoding  Specificity  Principle  due  to 
Endel  Tulving  [10]  states:  “Only  that  can  be  retrieved  that  has  been  stored,  and  how 
it  can  be  retrieved  depends  on  how  it  was  stored.  ”  And  in  [8]  Tulving  states:  “On 
the  unassailable  assumption  that  cue  A  has  more  information  in  common  with  the 
trace  of  the  A-T  [T  for  target  word]  pair  of  studied  items  than  with  the  trace  of 
the  B-T  pair,  we  can  say  that  the  probability  of  successful  retrieval  of  the  target 
item  is  a  monotonically  increasing  function  of  the  information  overlap  between  the 
information  present  at  retrieval  and  the  information  stored  in  memory.”  The  En¬ 
coding  Specificity  Principle  is  by  no  means  the  only  theory  to  explain  this  aspect 
of  human  memory  function,  and  a  number  of  experiments  have  been  performed  by 
protagonists  of  this  or  other  theories.  The  work  presented  in  this  paper  could  be 
taken  in  support  of  the  Encoding  Specificity  Principle,  but  this  is  not  the  intention 
of  the  authors.  However,  CMM  could  provide  a  model  of  the  low  level  workings  of 
human  memory  and,  as  such,  the  results  of  those  experiments  performed  in  relation 
to  the  Encoding  Specificity  Principle  are  most  interesting.  For  instance,  in  [9],  674 
subjects  learned  a  list  of  24  target  words,  (1)  with  one  of  two  sets  of  cue  words,  or 
(2)  without  cue  words;  the  cue  words  were  chosen  to  each  have  an  association  of  1% 
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with  the  corresponding  target  word.  In  retrieval,  subjects  were  asked  to  remember 
the  list  (a)  with  cues;  (b)  without  cues,  or  (c)  with  wrong  cues  (a  cue  word  taken 
from  the  alternative  list).  The  experimental  results  are  given  in  terms  of  the  mean 
number  of  words  remembered  and  are  as  follows,  given  in  descending  order  of  the 
mean  number  of  words  remembered:  (la)  14.93;  (2b)  10.62;  (lb)  8.72;  (2a)  8.52; 
(Ic)  7.45.  What  is  interesting  for  CMM  as  a  model  of  memory,  is  the  difference 
between  the  results  of  experiments  (la)  and  (lb).  If  one  postulates  that  the  target 
word  and  the  cue  word  are  somehow  stored  together,  like  the  input  to  a  CMM,  and 
that  the  output  would  be  the  equivalent  of  a  pointer  to  the  target  word  then  the 
results  of  this  paper  can  be  interpreted  in  the  light  of  the  results  in  [9].  To  do  so,  it 
is  necessary  to  envisage  some  indexing  mechanism  in  the  human  brain  for  retrieving 
information  from  memory  that  works  best  when  there  is  little  index  memory  to  be 
searched,  which  is  perfectly  plausible.  Then  (la)  would  correspond  to  a  full  match 
input,  which  would  be  expected  to  give  best  retrieval  performance,  as  was  found  to 
be  the  case  in  the  experiments;  similarly,  (lb)  would  correspond  to  a  partial  match 
input,  which  would  be  expected  to  give  a  rather  worse  retrieval  performance,  again 
as  was  found  to  be  the  case. 

5  Conclusions 

We  have  analysed  the  partial  match  performance  of  CMM,  in  line  with  the  proposed 
use  of  CMM  as  an  inference  engine  for  an  expert  system  that  must  solve  real  world 
problems.  The  analysis  has  enabled  a  comparison  with  a  conventional  database 
indexing  technique  that  supports  partial  match,  the  results  of  which  suggest  CMM 
is  good  at  performing  partial  match  when  the  input  is  well  specified  and  few  matches 
exist.  Interestingly,  a  similar  behaviour  is  observed  in  human  memory  experiments 
and,  although  we  would  not  go  so  far  as  to  suggest  that  human  memory  is  so 
simple  as  CMM,  we  observe  that  there  are  similarities  that  would  support  the  use 
of  CMM,  or  CMM  partial  match,  as  a  model  of  some  of  the  low  level  workings  of 
human  memory  in  further  experiments  in  cognitive  psychology. 
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Learning  and  generalization  in  a  two-layer  Radial  Basis  Function  network  (RBF)  is  examined 
within  a  stochastic  training  paradigm.  Employing  a  Bayesian  approach,  expressions  for  general¬ 
ization  error  are  derived  under  the  assumption  that  the  generating  mechanism  (teacher)  for  the 
training  data  is  also  an  RBF,  but  one  for  which  the  basis  function  centres  and  widths  need  not 
correspond  to  those  of  the  student  network.  The  effects  of  reguleirization,  via  a  weight  decay  term, 
are  examined.  The  cases  in  which  the  student  hajs  greater  representational  power  than  the  teacher 
(over-realizable),  and  in  which  the  teacher  has  greater  power  them  the  student  (unrealizable)  are 
studied.  Finally,  simtilations  are  performed  which  validate  the  euicilytic  results. 

1  Introduction 

When  considering  supervised  learning  in  neural  networks,  a  quantity  of  particular 
interest  is  the  generalization  error,  a  measure  of  the  average  deviation  between  de¬ 
sired  and  actual  network  output  across  the  space  of  possible  inputs.  Generalization 
error  consists  of  two  components:  approximation  error  and  estimation  error.  Given 
a  particular  architecture,  approximation  error  is  the  error  made  by  the  optimal 
student  of  that  architecture,  and  is  caused  by  the  architecture  having  insufficient 
representational  power  to  exactly  emulate  the  teacher;  it  is  an  asymptotic  quantity 
as  it  cannot  be  overcome  even  in  the  limit  of  an  infinite  amount  of  training  data. 
If  the  approximation  error  is  zero,  the  problem  is  termed  realizable;  if  not,  it  is 
termed  unrealizable.  Estimation  error  is  the  error  due  to  not  having  selected  an 
optimal  student  of  the  chosen  architecture;  it  is  a  dynamic  quantity  as  it  changes 
during  training  and  is  caused  by  having  insufficient  data,  noisy  data  or  a  learning 
algorithm  which  is  not  guaranteed  to  reach  an  optimal  solution  in  the  limit  of  an 
infinite  amount  of  data.  There  is  a  trade-off  between  representational  power  and 
the  amount  of  data  required  to  achieve  a  particular  error  value  (the  sample  com¬ 
plexity)  in  that  the  more  powerful  the  student,  the  greater  the  ability  to  eliminate 
the  approximation  error  but  the  larger  the  amount  of  data  required  to  find  a  good 
student. 

This  paper  employs  a  Bayesian  scheme  in  which  a  probability  distribution  is  derived 
for  the  weights  of  the  student;  similar  approaches  can  be  found  in  [6,  1,  2]  and  [3]. 
In  [7],  a  bound  is  derived  for  generalization  error  in  RBFs  under  the  Eissumption 
that  the  training  algorithm  finds  a  global  minimum  in  the  error  surface;  regular¬ 
ization  is  not  considered.  For  the  exactly  realizable  case.  Freeman  et  al.  calculate 
generalization  error  for  RBFs  using  a  similar  framework  to  that  employed  here  [2]. 

2  The  RBF  Network  and  Generalization  Error 

The  RBF  architecture  consists  of  a  two-layer  fully- connected  network,  and  is  a  uni¬ 
versal  approximator  for  continuous  functions  given  a  sufficient  number  of  hidden 
units  [4].  The  hidden  units  will  be  considered  to  be  Gaussian  basis  functions,  param- 
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eterised  by  a  vector  representing  the  position  of  the  basis  function  centre  in  input 
space  and  a  scalar  representing  the  width  of  the  basis  function.  These  parameters 
are  assumed  to  be  fixed  by  a  suitable  process,  such  as  a  clustering  algorithm.  The 
output  layer  computes  a  linear  combination  of  the  activations  of  the  basis  functions, 
parameterised  by  the  adaptive  weights  w  between  hidden  and  output  layers. 
Training  examples  consist  of  input-output  pairs  ($,  C)-  The  components  of  f  are  un¬ 
correlated  Gaussian  random  variables  of  mean  0,  variance  <t^,  while  C  is  generated 
by  applying  ^  to  a  teacher  RBF  and  corrupting  the  output  with  zero-mean  addi¬ 
tive  Gaussian  noise,  variance  cr^.  The  number,  position  and  widths  of  the  teacher 
hidden  units  need  not  correspond  to  those  of  the  student,  allowing  investigation  of 
over- realizable  and  unrealizable  cases.  The  mapping  implemented  by  the  teacher  is 
denoted  /t,  and  that  of  the  student  /s.  Note  it  is  impossible  to  examine  generaliza¬ 
tion  error  without  some  a  ‘priori  belief  in  the  teacher  mechanism  [9].  The  training 
algorithm  for  the  adaptive  weights  is  considered  stochastic  in  nature;  the  selected 
noise  process  leads  to  the  following  form  for  the  likelihood: 

V{D\w,l3)  a  exp{~PED)  (1) 

where  is  the  training  error.  This  form  resembles  a  Gibbs  distribution;  it  also 
corresponds  to  the  constraint  that  minimizing  the  training  error  is  equivalent  to 
maximizing  the  likelihood  [5] .  This  distribution  can  be  realised  by  employing  the 
Langevin  algorithm,  which  is  simply  gradient  descent  with  an  appropriate  noise 
term  added  to  the  weights  at  each  update  [8].  To  prevent  over-dependence  of  the 
distribution  of  student  weight  vectors  on  the  noise,  it  is  necessary  to  introduce  a 
regularizing  factor,  which  can  be  viewed  as  a  Bayesian  prior. 

V(w\j)  oc  exp{-jEw)  (2) 

where  Ew  is  a  penalty  term  based  here  on  the  magnitude  of  the  student  weight 
vector.  Employing  Bayes’  theorem,  one  can  derive  an  expression  for  the  probability 
of  a  student  weight  vector  given  the  training  data  and  training  parameters: 

V{w\D,  7,  /?)  oc  exp  {-PEd  -  jEw)  (3) 

The  common  definition  of  generalization  error  is  the  average  squared  difference 
between  the  target  function  and  the  estimating  function: 

Eiw)  =  {{fsitw)-fT(Oy)  (4) 

where  (•  •  •}  denotes  an  average  w.r.t.  the  input  distribution.  In  practice,  one  only 
has  access  to  the  test  error,  w))^.  This  quantity  is  an 

approximation  to  the  expected  risk,  defined  as  the  expectation  of  (C  — 
w.r.t.  the  joint  distribution  7^($,C)-  With  an  additive  noise  model,  the  expected 
risk  decomposes  to  E  (t^,  where  is  the  variance  of  the  noise. 

When  employing  stochastic  training,  two  possibilities  for  average  generalization 
error  arise.  If  one  weight  vector  is  selected  from  the  ensemble,  as  in  usually  the  case 
in  practice,  equation  (4)  becomes: 

Ea(l,  =  dwVMD,  7,  /3)  {fs{C,  w)  -  /t(C))"^  (5) 

Alternatively,  one  can  take  a  Bayes-optimal  approach  which,  for  squared  error, 
requires  taking  the  mean  estimate  of  the  network.  It  is  computationally  impractical 
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to  find  this,  but  it  is  interesting  as  it  represents  the  result  of  the  best  guess,  in  an 
average  sense.  In  this  case  generalization  error  takes  the  form  : 

In  order  to  examine  generic  architecture  performance  independently  of  the  particu¬ 
lar  data  employed,  an  average  over  datasets  is  performed,  taking  into  account  both 
the  position  of  the  data  in  input  space  and  the  noise. 

Defining  Ed  as  the  sum  of  squared  errors  over  the  training  set  and  defining  the 
regularization  term  Ew  =  ||w;|p,  Eg  and  Eb  can  be  found  from  equations  (3),  (5) 
and  (6).  The  calculation  is  straightforward  until  the  average  over  the  dataset;  the 
complications  and  their  resolution  are  discussed  in  [3],  Schematically: 

Eg  =  Student  Output  Variance  +  Noise  Error  -f-  (7) 

Student-Teacher  Mismatch 

Eb  =  Noise  Error  -|-  Student-Teacher  Mismatch  (8) 

The  student  output  variance  and  noise  error  are  estimation  error  only;  the  student- 
teacher  mismatch  is  both  estimation  and  approximation  error. 


Figure  1  The  effects  of  regulariza¬ 
tion:  the  solid  curve  represents  optimal 
regularization (7  =  2.7, (3  =  1,6),  the  dot- 
dash  cmve  illustrates  the  over-regularised 
case  (7  =  2.7,0  =  0.16),  and  the  dashed 
cm-ve  shows  the  highly  under-regularised 
case  (7  =  2.7,0  =  16).  The  student 
and  teacher  were  matched,  each  consist¬ 
ing  of  3  centres  at  (1,0),  (—0.5,  \/3/2)  and 
(— 0.5,  — 1/3/2).  Noise  of  variance  1  was 
employed. 


Figure  2  Simiilation  results. 
The  curves  are  for  a  realizable  case  with 
three  centres  at  (1,0),  (— 0.5,v^3/2)  and 
(— 0,5,  — -/3/2),  with  centre  width  i/2/2 
and  noise  of  variance  2f0.  The  empiri¬ 
cal  curves  were  generated  by  exhaustive 
training  at  each  value  of  P,  and  represent 
averages  over  100  trials.  The  error  bars 
are  at  1  standard  deviation  of  the  empir- 
iccil  distribution. 


3  Analysis  of  Generalization  Error 
3.1  The  Effects  of  Regularization 

While  the  effects  of  regularization  are  similar  for  Eg  and  Eb,  the  optimal  param¬ 
eter  settings,  found  by  minimizing  equations  (7)  and  (8)  w.r.t.  7  and  /?,  are  quite 
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Figure  3  The  Over-realizable  Case:  the 
dashed  curve  shows  the  over-realizable 
case  with  training  optimised  as  if  the  stu¬ 
dent  matches  the  teacher  (7  =  3.59,  ^0  = 
2.56),  the  solid  curve  illustrates  the  over- 
realizable  case  with  training  optimised 
with  respect  to  the  true  teacher  (7  = 
3.59,/?  =  1.44),  while  the  dot-dash  cmve 
is  for  the  student  matching  the  teacher 
(7  =  6.52,13  =  4.39).  All  the  curves 
were  generated  with  one  teacher  centre 
at  (1,0);  the  over-realizable  curves  had 
two  student  centres  at  (1,0)  and  (—1,0). 
Noise  with  variance  1  was  employed. 


Figure  4  The  unrealizable  case:  the 
solid  cturve  denotes  the  case  where  the 
student  is  optimised  as  if  the  teacher  is 
identical  to  it  (7  =  2.22,/?  =  1.55); 
the  dashed  curve  demonstrates  the  stu¬ 
dent  optimised  with  knowledge  of  the  true 
teacher  (7  =  2.22,/?  =  3.05),  while,  for 
comparison,  the  dot-dash  cxirve  shows  a 
student  which  matches  the  teacher  (7  = 
2.22, (3  =  1.05).  The  curves  were  gener¬ 
ated  with  two  teacher  centres  at  (1 , 0)  and 
(—1,0);  the  unrealizable  cimves  employed 
a  single  student  at  (1,0);  noise  of  variance 
1  was  utilised. 


different.  As  discussed  in  [2],  for  Eg  it  is  necessary  to  optimise  7  and  (3  jointly,  while 
for  Eb,  only  the  ratio  of  7  to  /?  need  be  considered;  this  optimal  ratio  is  indepen¬ 
dent  of  P.  This  dissimilarity  is  due  to  the  variance  term  in  Eg,  which  is  minimised 
by  taking  /?  — >■  00.  These  findings  hold  for  both  realizable  and  unrealizable  cases. 
To  demonstrate  the  effects  of  regularization  in  a  realizable  scenario,  consider  figure 
1  where  Eb  is  plotted  versus  P  for  three  cases.  The  solid  curve  illustrates  opti¬ 
mal  regularization  and  shows  the  lowest  value  of  generalization  error  that  can  be 
achieved  on  average;  the  dot-dash  curve  represents  the  over-regularised  case,  in 
which  the  prior  dominates  the  likelihood,  showing  how  reduction  in  generalization 
error  is  substantially  slowed.  The  dashed  curve  is  for  the  highly  under-regularised 
case,  which  in  the  j//3  — ^  0  case  gives  a  divergence  in  both  Eg  and  Eb-  Although 
under-regularization  is  initially  more  damaging  than  over-regularization,  its  effects 
are  recovered  from  more  rapidly.  As  training  proceeds,  the  likelihood  comes  to  dom¬ 
inate  the  prior,  making  the  incorrect  setting  of  the  prior  irrelevant;  this  proceeds 
more  rapidly  when  the  prior  is  weak  with  respect  to  the  likelihood. 

3.2  The  Over- Realizable  and  Unrealizable  Scenarios 

Operationally,  selecting  a  form  for  the  student  implies  that  one  is  prepared  to 
believe  that  the  teacher  has  an  identical  form.  Therefore  optimization  of  training 
parameters  must  be  performed  on  this  basis.  When  the  student  is  overly  powerful 
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this  leads  to  under-regularization,  as  the  magnitude  of  the  teacher  weight  vector  is 
believed  to  be  larger  than  the  true  case.  This  is  shown  in  figure  3;  the  dashed  curve 
represents  generalization  error  for  the  under-regularised  case  where  the  training 
parameters  have  been  optimised  as  if  the  teacher  matches  the  student,  while  the 
solid  curve  represents  the  same  student,  but  with  training  optimised  with  respect  to 
the  true  teacher.  Employing  an  overly- powerful  student  rather  than  the  optimally- 
matched  student  can  drastically  slow  the  reduction  of  generalization  error.  Even 
with  training  optimised  with  respect  to  the  true  teacher,  the  matching  student 
greatly  out-performs  the  overly- powerful  version  due  to  the  necessity  to  suppress  the 
redundant  parameters.  The  effect  is  shown  in  figure  3;  generalization  error  for  the 
matching  student  is  given  by  the  dot-dash  curve,  while  that  of  the  overly-powerful 
but  correctly  optimised  student  is  given  by  the  solid  curve.  In  the  unrealizable 
scenario,  where  the  teacher  is  more  powerful  than  the  student,  optimization  of 
training  parameters  under  the  belief  that  the  teacher  has  the  same  form  as  the 
student  leads  to  over- regularization,  as  the  assumed  magnitude  of  the  teacher  weight 
vector  is  greater  than  the  true  magnitude.  This  effect  is  shown  in  figure  4,  in  which 
the  solid  curve  denotes  generalization  error  for  the  over-regularised  case  based  on 
the  belief  that  the  teacher  matches  the  student,  while  the  dashed  curve  shows  the 
error  for  an  identical  student  when  the  parameters  of  the  true  teacher  are  known; 
this  knowledge  permits  optimal  regularization.  The  most  significant  effect  of  the 
teacher  being  more  powerful  than  the  student  is  the  fact  that  the  approximation 
error  is  no  longer  zero,  as  the  teacher  can  never  be  exactly  emulated  by  the  student. 
This  is  also  illustrated  in  figure  4,  where  the  dot-dash  curve  represents  the  learning 
curve  when  the  student  matches  the  teacher  (with  zero  asymptote),  while  the  two 
upper  curves  show  an  under-powerful  student,  and  have  non-zero  asymptotes. 

3»3  Simulations 

To  validate  the  analytic  results,  simulations  were  performed  for 
optimally-regularised  and  under-regularised  realizable  cases.  The  simulations  in¬ 
volved  exhaustive  training  of  RBF  networks  using  Langevin  updating.  The  empiri¬ 
cal  results  were  generated  by  averaging  over  100  runs,  approximating  generalization 
error  using  a  large,  noiseless  test  set.  The  results  are  shown  in  figure  2;  an  excel¬ 
lent  fit  between  analytic  and  simulated  results  is  found  for  P  >  50.  In  the  region 
where  P  is  small  in  the  under-regularised  case,  the  analytic  mean  is  larger  than  the 
simulation  result.  In  the  small  P  region,  this  case  is  particularly  vulnerable  to  the 
approximation  employed  in  the  dataset  average  (see  [3]).  The  errorbars  are  also  large 
in  this  region,  as  the  distribution  of  student  weights  is  relatively  unconstrained. 

4  Conclusion 

Under-regularization  initially  causes  very  poor  generalization,  which  can  be  over¬ 
come  rapidly  with  the  addition  of  more  training  data.  Over-regularization  is  initially 
less  damaging,  but  requires  a  large  quantity  of  training  data  in  order  to  overcome 
the  effect.  In  the  over-realizable  case,  there  is  a  tendency  to  under-regularise  due  to 
over-estimating  the  complexity  of  the  teacher.  There  is  also  an  increase  in  sample 
complexity.  In  the  unrealizable  case,  under-estimating  the  complexity  of  the  teacher 
leads  to  over-regularization. 
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A  general  approach  is  developed  to  leeim  the  condition2J  probability  density  for  a  noisy  time  series. 
A  universal  architecture  is  proposed,  which  avoids  difficulties  with  the  singular  low-noise  limit.  A 
suitable  error  function  is  presented  enabling  the  probabihty  density  to  be  learnt.  The  method  is 
compared  with  other  recently  developed  approaches,  and  its  effectiveness  demonstrated  on  a  time 
series  generated  from  a  non-trivieJ  stochastic  dynamical  system. 

1  Introduction 

Neural  networks  are  used  extensively  to  learn  time  series,  but  most  of  the  ap¬ 
proaches,  especially  those  associated  with  the  mean  square  error  function  whose 
minimisation  implies  approximating  the  predictor  x{t)  — >•  -|-  1),  will  only  give 

information  on  a  mean  estimate  of  such  a  prediction.  It  is  becoming  increasingly 
important  to  learn  more  about  a  time  series,  especially  when  it  involves  a  consid¬ 
erable  amount  of  noise,  as  in  the  case  of  financial  time  series.  This  note  is  on  more 
recent  attempts  to  learn  the  conditional  probability  distribution  of  the  time  series, 
so  the  quantity 

P(2:(<  +  l)|x(<))  (1) 

where  x{t)  denotes  the  time-delayed  vector  with  components 
m)),  for  some  suitable  integer  m. 

An  important  question  is  as  to  the  nature  of  a  neural  network  architecture  that  can 
learn  such  a  distribution  when  both  very  noisy  and  nearly  deterministic  time  series 
have  to  be  considered.  A  variety  of  approaches  have  independently  been  recently 
proposed  (Weigend  and  Nix,  1994;  Srivastava  and  Weigend,  1994;  Neuneier  et  al, 
1994),  and  this  plethora  of  methods  may  lead  to  a  degree  of  uncertainty  as  to 
their  relative  effectiveness.  We  here  want  to  extend  the  discussion  of  an  earlier 
paper  (Allen  and  Taylor,  1994)  and  will  derive  the  minimal  structure  of  a  universal 
approximator  for  learning  conditional  probability  distributions.  We  will  then  point 
out  the  relation  of  this  structure  to  the  concepts  mentioned  above,  and  will  finally 
apply  the  method  to  a  time  series  generated  from  a  stochastic  dynamical  system. 

2  The  General  ANN  Approach 

2.1  Minimal  Required  Network  Structure 

Let  us  formulate  the  problem  in  terms  of  the  cumulative  probability  distribution 

F(?/|x(t))  p(x{t  +  1)  <  2/|x(f))  =  /  p(y'|x(i))  dy'  (2) 

J  —  oo 

first,  as  this  function  does  not  become  singular  in  the  noise-free  limit  of  a  deter¬ 
ministic  process,  but  reduces  to 

ns/Wt))  =  %-/(x(i)))  (3) 

(where  ^(.)  is  the  threshold  or  Heaviside  function,  0(x)  =  1  if  a?  >  0,  0  otherwise). 
We  want  to  derive  the  structure  of  a  network  that,  firstly,  is  a  universal  approx- 
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imator  for  F{.)  and,  secondly,  obtains  the  noise-free  limit  of  (3).  It  is  clear  that 
a  difficulty  would  arise  if  the  universal  approximation  theorem  of  neural  networks 
were  applied  directly  to  the  function  F(y,  x(t))  so  as  to  expand  it  in  terms  of  the 
output  of  a  single  hidden  layer  net, 


i^(i/|x(t))  =  aiS  YZ  ~ 


where  Xj  :=  x{t  -  i  +  1),  S(x)  =  1/(1  -f  e“^),  and  Of,  Wij,  bj,  are  constants  with 
obvious  interpretation  in  neural  network  terms  (a^-  output  weights,  i?,-  thresholds, 
Wij,  bi  weights  between  input  and  hidden  layer).  Trying  to  obtain  the  deterministic 
case  (3),  the  best  one  can  get  from  (4)  in  the  limit  when  Oi  — »•  Si^i^  (Kronecker 
delta)  and  the  weights  of  the  hidden  nodes  become  very  large  (so  that  each  sigmoid 
function  S{.)  entering  in  (4)  becomes  a  threshold  function  ^(.)),  is  the  expression 


(5) 


with  constants  Cj,s.  Thus  at  best  it  is  only  possible  to  approximate  the  deterministic 
limit  of  a  linear  process,  with  /(x)  a  linear  function  of  its  variable. 

To  be  able  to  handle  the  split  between  the  single  predicted  variable  y  and  the  input 
variable  x,  we  proceed  by  developing  a  universal  one-layer  representation  for  y  first, 
and  then  do  the  same  for  the  remaining  variable  x  successively.  Thus  at  the  first 
step  results 

F{y\x{t))  =  '^aiS{0i(y-fti)),  (6) 


where  ft  is  the  steepness  of  the  sigmoid  function,  and  fii  a  given  threshold.  We 
then  expand  each  yi  by  means  of  the  universal  approximation  theorem  in  terms  of 
a  further  hidden  layer  network,  resulting  in 

In  order  that  the  right  hand  side  of  (6)  be  a  cumulative  probabilty  distribution,  i.e. 
i^(— oo|x)  =  0,  F(oo|x)  =  1  and  F{y\x)  monotone  increasing  in  the  variable  y,  the 
following  conditions  are  imposed  on  the  coefficients  entering  eq.(6): 

ft>0,  ai>0,  =  1  w 

i 

This  can  be  realized  by  taking  ft  and  as  functions  of  new  variables  and  q,  e.g. 
A  =  (A)^,  and  Oi  =  or  a,-  =  exp(?,)/ exp(cfc). 

It  is  now  possible  to  see  that  the  deterministic  limit  (3)  arises  smoothly  from  the 
universal  two-hidden-layer  architecture  of  equations  (6)  and  (7)  when  a*  — >  5ij 
and  ft  oo,  i.e.,  for  one  output  weight  one  and  all  the  others  zero,  and  a  large 
steepness  parameter  for  the  unit  connected  to  the  non- vanishing  weight. 

2.2  Moments 

One  of  the  attractive  properties  of  the  use  of  sigmoid  response  functions  is  that  it 
is  possible  to  give  (after  some  algebra)  a  closed  form  for  the  moment  generating 
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function  (Allen,  Taylor,  1994), 


/oo 

exp(<y)p(i/|x)  dy  =  '^ 

-OO 


ai{7rt/l3i)exp(fiit) 


(9) 


sin(7r^//?t) 

from  which  all  the  moments  arising  from  the  conditional  probability  density  (1)  can 
be  calculated,  e.g. 


E{Y)  =  lim 
^  ’  t-^o  dt 


E(Y^} 


d^M{t)  _ 


lim 
<-►0  dt^ 


Y^ai 


+ 


Ks) 


(11) 


For  Qi  — >■  Sij,  (dj  oo  the  deterministic  limit  ensues,  with  EiY"^)  =  [E(Y)Y 


2.3  The  Error  Function 

It  is  now  necessary  to  suggest  a  suitable  error  function  in  order  to  make  the  universal 
expression  given  by  (6)  and  (7)  the  true  conditional  cumulative  distribution  function 
given  by  the  available  data.  As  the  true  value  for  x{t  -j- 1)  is  not  known,  one  cannot 
use  standard  mean  square  error.  Let  us  assume  that  the  process  is  stationary.  Then 
the  negative  expectation  value  of  the  logarithm  of  the  likelihood, 

E  =  -(log(p(a;(<  +  l)|x(i))))  «  -^p(x{t  +  l)|x(i)),  (12) 

is  the  appropriate  choice,  based  on  Kullback’s  Lemma  (see,  e.g.,  Papoulis,  1984), 
according  to  which  (12)  in  terms  of  the  true  (unknown)  probability  density,  Ptrue? 
is  always  smaller  than  (12)  in  terms  of  the  probability  density  predicted  by  the 
network,  p.  Hence  by  minimising  (12)  one  can  hope  to  always  “get  closer”  to  Ptrue- 


2.4  Regaining  the  Conditional  Probability  Density 

In  order  to  get  back  to  the  conditional  probability  density,  we  have  to  take  the 
derivative  of  the  output  of  (6)  with  respect  to  the  target,  y,  yielding 

p(y|x)  =  =  ^Sip{y - p(x)))(l  -  5(/3(p -  p(x)))  (13) 

This  function  is  Gaussian-shaped,  so  our  resulting  network  structure  can  be  sum¬ 
marized  as  follows:  The  output  node,  which  predicts  the  conditional  probability 
density  p{x{t  +  1)  =  y|x(i)),  is  connected  to  a  layer  of  RBF-like  nodes.  The  out¬ 
put  of  the  zth  “RBF”-unit  is  determined  by  its  input,  x{t  -f  1)  (the  same  for  all 
nodes),  and  its  centre,  piy  the  latter  being  an  x-dependent  function  implemented  in 
a  seperate  one- hidden-  layer  network  with  the  usual  sigmoidal  hidden  nodes.  The 
parameters  Uj  and  have  obvious  interpretations  as  a  priori  probability  and  inverse 
kernel  width,  respectively.  (Note  that  from  (10)  and  (11)  one  gets  =  7r/(\/3/?i)  for 
the  standard  deviation  of  the  zth  kernel,  i.e.  its  “width”).  The  same  structure  can 
basically  be  found  in  the  other  approaches  mentioned  above  too,  with  the  following 
differences  and  simplifications. 


3  Comparison  with  Other  Approaches 

The  CDEN  (Conditional  Density  Estimation  Network)  of  Neuneier  et  al.  (1994) 
uses  only  one  hidden  layer  for  computing  all  of  the  which  corresponds  to 

simplifying  (7)  by  making  Cijk  and  dij  independent  of  the  subscript  i.  This  simpli¬ 
fication  is  certainly  justified  when  the  //i(x)  are  of  a  similar  functional  form,  with 
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Network 

“RBF”  ker¬ 
nel  function 

yi 

x-dependent 

o-i 

x-dependent 

Ctf 

x-dependent 

As  proposed 
here 

5'(.) 

yes,  seperate 
hidden  layers 

no 

no 

CDEN 

Gaussian 

yes,  1  shared 
hidden  layer 

yes 

yes 

Soft 

histograms 

triangular 

no,  from 

preprocessing 

no,  from 

preprocessing 

yes 

Weigend, 

Nix 

1  Gaussian 

yes 

yes 

unity 

Table  1  Comparison  between  the  different  network  eirchitectures  mentioned  in 
the  text. 


similar  a  priori  probabilities,  but  may  cause  difficulties  when  this  assumption  is 
strictly  violated.  On  the  other  hand,  the  CDEN  includes  a  further  generalisation 
of  the  architecture  proposed  here  by  making  the  kernel  widths,  <7*,  and  the  output 
weights,  ai,  x-dependent,  computing  them  as  outputs  of  seperate  one-  hidden-layer 
networks.  This  is  not  necessary  in  order  for  the  architecture  to  be  a  universal  ap¬ 
proximator,  but  may  lead  to  a  considerable  reduction  of  the  “RBF”-layer  size,  the 
latter,  though,  at  the  expense  of  additional  complexity  in  other  parts  of  the  net¬ 
work.  It  thus  depends  on  the  specific  problem  under  consideration  if  this  further 
generalisation  is  an  advantage. 

The  soft  histogram  approach  proposed  by  Srivastava  and  Weigend  (1994)  is  identical 
to  a  mixture  of  triangular-shaped  “RBF”-nodes,  P(y|x)  =  with 

Ri(y)  =  iifii-i  <y  <  fii,  Ri{y)  =  hi{yi+i  -  y)  /  -gi) 

'A  yi  <  y  <  and  Ri{y)  =  0  otherwise.  The  “heights”  hi  are  related  to  the 

kernel  widths  ai  =:  /ii+i  —  yi-\  by  hi  —  2/(Ti  in  order  to  ensure  that  the  the 
normalisation  condition  dy  =  1  is  satisfied.  The  kernel  centres  result 

from  a  preprocessing  “binning”  procedure  (described  in  [Srivastava,  Weigend  1994]). 
The  output  of  the  network  is  thus  similar  to  the  CDEN  and  the  model  proposed 
in  this  paper,  with  the  difference  that  now  only  the  output  weights  aj(x)  are  x- 
dependent,  whereas  the  kernel  centres,  yi,  are  held  constant. 

Finally  the  architecture  introduced  by  Weigend  and  Nix  (1994)  reduces  the  size 
of  the  “RBF”  layer  to  only  one  node,  assuming  a  Gaussian  probability  distribu¬ 
tion,  which  is  completely  specified  by  y  and  a,  both  of  which  are  given  as  the 
(x-dependent)  output  of  a  seperate  network  branch.  Obviously,  this  parametric 
approximation  reduces  the  network  complexity,  but  leads  to  a  biased  prediction. 

4  Simulation 

We  tested  the  performance  of  our  method  on  an  artificial  time  series  generated  from 
a  stochastic  coupling  of  two  stochastic  dynamical  systems, 

x{t  -I-  1)  =  0(^  -  't?)aa?(f)[l  -  x{t)]  +  (1  -  ©(^  -  i?))  [1  -  a:''(i)] ,  (14) 

where  0(.)  symbolizes  the  Heaviside  function,  as  before,  and  the  parameters  a,  k 
and  C  are  random  variables  drawn  from  a  uniform  distribution,  ^  6  [0, 1],  a  6  [3, 4], 
K  G  [0.5, 1.25].  The  prior  probabilities  of  the  two  processes  are  determined  by  the 
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Figure  1  Centres  of  the  “RBF”- 
kemels,  after  training  (black 

lines).  The  grey  dots  show  a  state-space 
plot  of  the  time  series. 


■®*0  0.2  0.4  0.6  0.8  1 

«(U1) 


Figure  2  Cross  section  of  the  true  (nar¬ 
row  graph)  and  predicted  (bold  graph) 
conditional  probability  density,  P{x{t  + 
1)  =  y(r(t)  =  0.6). 


treshold  constant 'd,  which  was  chosen  such  that  we  got  a  ratio  of  2:1  in  favour  of  the 
left  process  (i.e.,  i?  =  1/3).  We  applied  a  network  with  ten  “RBF” -nodes  to  learn 
the  conditional  probability  density  of  this  time  series.  Figure  1  shows  a  state  space 
plot  of  the  time  series  (dots)  and  the  centre  positions  of  the  “RBF” -kernels,  ^i(x{t)) 
(black  lines).  Figure  2  shows  a  cross-section  of  the  conditional  probability  density  for 
x(t)  =  0.6,  P{x(t  -f  1)  =  y|a?(t)  =  0.6),  and  compares  the  correct  function  (narrow 
grey  line)  with  the  one  predicted  by  the  network  (bold  black  line).  Apparently  the 
network  has  captured  the  relevant  features  of  the  stochastic  process  and  correctly 
predicts  the  existence  of  two  clusters.  Note  that  a  conventional  network  for  time 
series  prediction,  which  only  predicts  the  conditional  mean  of  the  process,  would  be 
completely  inappropriate  in  this  case  as  it  would  predict  a  value  between  the  two 
clusters,  which  actually  never  occurs.  The  same  holds  for  the  network  proposed  by 
Weigend  and  Nix,  which  would  predict  a  value  for  a:(^-i-l)  between  the  two  clusters, 
too,  and  a  much  too  large  error  bar  resulting  from  the  assumption  of  a  Gaussian 
distribution. 

5  Conclusions 

A  general  approach  to  the  learning  of  conditional  probability  densities  for  station¬ 
ary  stochastic  time  series  has  been  presented,  which  overcomes  the  limitation  of 
restricted  reduction  to  the  noise-free  case.  The  minimal  architecture  required  for 
the  network  to  be  a  universal  approximator  and  to  contain  the  non-restricted  noise- 
free  case  was  found  to  be  of  a  two-hidden-layer  structure.  We  have  shown  that  the 
other  recently  developed  approaches  to  the  same  problem  are  of  a  very  similar  form, 
differing  basically  only  with  respect  to  the  x-dependence  of  the  parameters  and  the 
functional  form  of  the  “RBF” -kernels.  Which  architecture  one  should  finally  adopt 
depends  on  the  specific  problem  under  consideration.  While  the  structure  proposed 
by  Weigend  and  Nix  is  suitable  for  a  computationally  cheap,  but  biased  approxi¬ 
mation  of  p(x{t  -\-  l)|x(t))  (approximation  by  a  single  Gaussian),  the  CDEN,  the 
method  of  soft  histograms  and  the  network  proposed  here  offer  a  more  accurate, 
but  also  computationally  more  expensive  determination  of  the  whole  conditional 
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probability  density.  It  is  an  important  subject  of  further  research  to  assess  the  rel¬ 
ative  merits  and  drawbacks  of  the  latter  three  models  by  carrying  out  comparative 

simulations  on  a  set  of  benchmark  problems. 
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The  additive  neiiral  network  model  is  a  nonlinear  dynamical  system  and  it  is  well-known  that  if 
either  the  weight  matrices  are  symmetric  or  the  dynamical  system  is  cooperative  and  irreducible 
(with  isolated  equilibria)  then  the  net  exhibits  convergent  activation  dynamics.  In  this  paper  we 
present  a  convegence  thoerem  for  additive  neural  nets  with  ramp  sigmoid  activation  functions 
having  a  row-dominant  weight  matrix.  Of  course,  such  nets  are  not,  in  general,  cooperative  or 
possess  symmetic  weight  matices.  We  also  indicate  the  application  of  this  theorem  to  a  new  class 
of  neural  networks  -  the  Cellular  Neured  Networks  -  and  consider  its  usefulness  as  a  practical 
result  in  image  processing  applications. 

1  Introduction 

In  the  last  few  years  the  definition  of  the  neural  network  dynamical  system  has 
expanded  rapidly.  This  paper  concerns  neural  networks  associated  with  the  set  of 
ODEs: 

n 

Xi  =  —Xi  +  +  ly  (1) 

j=i 

where  x  €  IR”,  describes  the  activity  levels  of  n  ‘neurons’,  hij  is  a  real  constant 
representing  the  connection  strength  from  neuron  j  to  neuron  i,  /  is  a  sigmoid 
function  and  I  represents  a  clamped  input.  In  order  to  facilitate  the  analysis  of  (1) 
we  define  /  as  a  ramp  sigmoid: 

/(“^)  =  jd^  +  l  I  -  I  2;- 1  I).  (2) 

We  wish  to  investigate  the  activation  dynamics  of  these  nets  operating  as  CAMs. 
Upon  presentation  of  an  input  the  net  should  ‘flow’,  eventually  settling  at  a  station¬ 
ary  point  of  the  dynamical  system  which  represents  output  in  the  form  of  n  reals. 
With  this  dynamic  in  mind  we  wish  to  prevent,  by  judicious  design,  any  of  the  more 
exotic  dynamical  behaviour  exhibited  by  nonlinear  systems.  What  we  seek  to  prove 
is  that  the  union  of  the  basins  of  attraction  of  equilibria  comprises  the  whole  phase 
space.  It  is  natural  then  to  make  the  following  definition: 

Definition  1  A  dynamical  system  is  said  to  be  convergent  if  for  every  trajectory 
we  have: 

lim  (f)t{x)  =  T], 
t—yoo 

where  t]  is  an  equilibrium  point. 

It  is  possible  to  relax  this  definition  in  different  ways  and  still  have  workable  def¬ 
initions  of  the  behaviour  we  require.  For  more  details,  see  the  excellent  paper  by 
Hirsch,  [1], 

It  is  easy  enough  to  show  that  solutions,  x{t,xo)  to  (1),  have  bounded  forward 
closure  if  ||3?o||  <  A",  K  >  0,  in  the  sup  norm.  We  will  assume  boundedness  of 


204 


205 


Joy:  Convergence  of  Neural  Nets 

inital  conditions  throughout,  so  that  the  phase  space  is  a  compact  subset  of  IR”. 
By  standard  theory,  all  trajectories  can  be  extended  to  have  domain  of  definitions 
equal  to  the  whole  of  IR. 

Since  /  is  linear  in  the  three  segments  defined  by:  |  a?  |<  1,  a?  >  1  and  x  <  —1, 
the  CNN  vector  field  is  piecewise  linear.  The  regions  of  on  which  the  vector  field 
is  constant  will  be  called  partial  or  total  saturation  regions  depending  on  whether 
some  cells  are  in  their  linear  region  or  not.  It  is  clear  (by  piecewise  linearity)  that 
if  any  total  saturation  region  contains  a  stable  equilibrium  it  is  a  trapping  region; 
whilst  the  absence  of  an  equilibrium  means  that  all  trajectories  will  leave  such  a 
region.  Finally,  Jacobians  are  constant  on  the  various  partial  saturation  regions. 
The  analysis  of  large-scale  interconnected  systems  often  proceeds  by  investigating 
the  behaviour  of  the  system  thought  of  as  a  collection  of  subunits.  When  we  can 
describe  the  dynamical  behaviour  of  the  subunits  successfully  we  may  make  progress 
towards  the  description  of  the  behaviour  of  the  system  as  a  whole  by  supposing 
that  the  connections  between  units  are  so  weak  that  the  dynamical  behaviour  of 
the  isolated  units  dominate.  In  [2]  an  analysis  of  a  type  of  additive  neural  net 
was  presented  which  presumed  a  strict  row- diagonally  dominant  condition  on  the 
feedback  matrix  B  =  (bij);  namely: 

6ii  -  1  >  I  bij  I  .  (3) 

Roughly  speaking  (3)  says  that  in  terms  of  the  dynamics,  the  self-feedback  of  each 
CNN  cell  dominates  over  the  summed  influences  from  its  neighbours  and  may  there¬ 
fore  be  viewed  as  an  example  of  the  above  large-scale  system  analysis.  In  section 
III  of  [2],  the  authors  conjectured  that  (3)  is  sufficient  to  ensure  convergence  of  the 
associated  net;  furthermore  they  outlined  a  method  of  proof.  In  this  note  we  show 
that  this  conjecture  is  true  by  giving  a  rigorous  argument  built  on  their  ideas. 

2  Convergence  of  the  Dynamical  System. 

Of  prime  importance  to  our  result  is  the  fact  that  if  —  7  satisfies  (3),  then  each 
total  saturation  region  contains  a  (unique)  equilibrium  point;  this  is  the  substance 
of: 

Lemma  2  If  B  is  strictly  row- diagonally  dominant,  then  every  total  saturation 
region  contains  an  equilibrium. 

Proof  We  borrow  some  notation  from  [3];  let  S  C  {1, . . . ,  n}  be  of  cardinality  m 
and  define  the  following  sets: 

Am  =  {e  ^  (£i, . .  .,£n)  1  =  0,z  €  5  and  e  G  {-1, 1},  for  i  ^  5}, 

where: 

A  —  {(ci, . .  .,€n)  I  Ci  G  {  —  1,1}}. 

Then  for  each  e  G  A  define: 

C{€)  =  {(a:i,. .  .a:„)  G  1 1  |<  1,  if  ei  =  0,Xi>l,  if  a  ^  l,Xi  <  - 1 ,  otherwise  } . 

Firstly  we  consider  the  total  saturation  region  C(£s),  for  some  Sg  G  Aq.  Suppose 
that  an  equilibrium  x  for  the  linear  system  of  equations  restricted  to  C{£s),  satisfies 
f{xi)  =  1;  then: 

Xi  =  bii  -f  ^  bij  f  (xj )  >  bii  —  y  ^  I  bij  |  >  1 , 
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provided  that  (3)  is  satisfied.  Similarly,  if  f(xi)  =  -1,  Xi  <  -1.  Thus  each  total 
saturation  region  contains  a  (unique)  equilibrium.  □ 

Much  more  than  lemma  (2)  is  true,  in  fact  it  is  possible  to  show  that  there  exists 
an  equilibrium  in  each  partial  and  total  saturation  region,  if  and  only  if  J5  —  / 
satisfies  (3).  In  this  case  it  is  possible,  by  a  linearly  invertible  change  of  variables 
(Grossberg’s  5S-exchange),  to  consider  the  dynamical  system  operating  on  a  closed 
hypercube,  a  so-called  BSB  system  (“Brain-State-In-A-Box”),  then  the  existence 
equilibria  at  each  corner  of  the  hypercube  is  shown  in  [4] . 

Our  main  theorem  is: 


Theorem  3  If  B  —  I  is  strictly  row- diagonally  dominant  then  the  dynamical  system 
(1)  with  N  cells  defined  by: 

X  =  -x-\-  Bf{x), 


is  convergent. 


Proof  By  Gerschgorin’s  Circle  theorem  (see  [5]),  if  (3)  is  satisfied  all  the  Ger- 
schgorin  discs  lie  in  the  right  half-plane.  Thus  any  unsaturated  variable,  corre¬ 
sponding  to  a  cell  operating  in  its  linear  region,  has  an  associated  eigenvalue  with 
positive  real  part.  It  follows  that  trajectories  decay  from  partial  saturation  regions 
and  the  linear  region  since  at  least  one  eigenvalue  of  the  Jacobian  has  positive  real 
part  there  (unless,  of  course  a  trajectory  lies  in  the  stable  manifold  of  an  unstable 
equilibrium  point,  in  which  case  it  converges  to  an  equilibrium  anyway). 

Once  a  hyperplane  Xi  —  ±1  has  been  crossed  by  a  trajectory  <p{t),  say  at  at  time 
to,  then  at  no  future  time  t  >  to,  will  this  trajectory  satisfy  \  \<  1.  To  see  this 

consider  the  vector  field  along  the  hyperplane  defined  by  Xj  =  1.  We  have: 

Xi  =  -  1  +  X]  (4) 

and  it  follows  that  Xi  >  0  along  since  the  maximum  value  that  the  summation 
term  in  (4)  attains  equals: 

E I  I’ 

because  |  f{x)  |<  1  for  all  x.  (An  identical  argument  along  Xi  =  —I  holds).  Thus 
the  vector  field  points  ‘outwards’,  i.e  in  the  direction  of  increasing  \xi  |,  along  the 
hyperplanes  forming  the  boundary  of  the  individual  linear  regions  and  therefore  no 
trajectory  can  decay  across  any  with  |  (pi  \  decreasing. 

If  we  define: 

p(<p(t,x))  —  #  (saturated  components  of  <p{t,  a:)), 
then  the  argument  in  the  previous  paragraph  shows  that  p{t)  is  a  non-decreasing 
function  (defined  on  {a?}  x  [0,  oo)),  which  is  bounded  above  by  N.  Thus  lim^-^oo  p{t) 
exists  and  actually  must  equal  N  (if  not,  given  some  T  >  0,  there  is  at  least  one 
unsaturated  cell,  i.e  there  exists  4,  say,  such  that  |  (f)i^{x,t)  \<  1,  for  alH  >  T 
and  1  <  ^  <  A".  However,  since  the  real  parts  of  the  corresponding  eigenvalues  are 
positive,  there  exists  ato  >  T  such  that  \  (pi j^{x, to)  |>  1,  which  is  a  contradiction). 
Since  all  total  saturation  regions  contain  a  stable  equilibrium  point  by  Lemma  2, 
and  the  bctsin  of  attraction  of  such  an  equilibrium  contains  the  total  saturation 
region,  all  trajectories  converge.  □ 
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By  an  identical  argument  we  are  able  to  prove  the  more  general  result: 
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Theorem  4  If  an  additive  neural  network  described  by  (1)  satisfies  the  condition: 

n 

hi  -  1  >  I  hj  I  +  I  4’  I  (5) 


then  it  is  convergent. 


3  Applications  to  CNNs  and  Final  Remarks 

Recently  a  new  class  of  neural  networks  called  Cellular  Neural  Networks  ( CNNs), 
has  been  introduced,  (see  [6]);  such  nets  are  recurrent,  continuous  time  networks 
described  by  (1),  with  a  particular  structure  on  B,  called  the  feedback  ma,tTix,  (that 
arises  from  the  architecture  of  the  net),  together  with  a  ramp  sigmoid  activation 
function.  Therefore  our  results  (3)  and  (4)  from  the  previous  section  apply  to  such 
nets. 

The  concept  of  a  CNN  is  rooted  in  2D  nonlinear  Filter  Theory,  and  Cellular  Au¬ 
tomata.  The  neurons  are  arrayed  on  a  two-dimensional  grid  and  any  fixed  neuron 
can  be  thought  of  as  a  centre  neuron  which  is  connected,  through  a  neighbourhood 
of  size  r  >  1,  to  its  nearest  neighbours;  i.  e.  any  cell  which  can  be  arrived  at  in  a 
kings  walk  of  length  <  r  on  a  chessboard.  Neighbourhood  sizes  can  increase  in  size 
up  to  a  fully  interconnected  net,  but  usually  we  insist  on  r  =  1  so  that  only  the 
‘nearest’  neighbours  are  connected.  The  intention  of  this  architecture  is  to  facilitate 
the  fabrication  of  such  a  device  onto  a  chip,  for  reducing  connections  increases  the 
chances  of  a  successful  implementation. 

CNNs  have  found  a  variety  of  interesting  and  useful  applications  and  most  cases 
share  the  common  idea  that  solutions  to  processing  tasks  are  represented  as  constant 
attracting  trajectories  of  the  associated  dynamical  system.  Upon  presentation  of  an 
input  image,  coded  as  an  array  of  grey-scale  pixels,  the  CNN  should  operate  as  a 
CAM,  ‘flowing’  to  an  output  image. 

The  strict  row-diagonal  dominance  of  the  feedback  matrix  is  a  severe  restriction  on 
the  parameters  because  in  practical  terms  such  a  CNN  would  not  process  bipolar 
images.  A  trajectory  with  a  ‘bipolar’  initial  condition  would  of  course  lie  in  a  total 
saturation  region.  If  (3)  holds  all  such  regions  are  contained  in  the  basin  of  attraction 
of  a  stable  equilibrium  point  x.  The  output  of  the  CNN,  f{x),  would  then  be 
identical  to  the  input  image  and  we  would  observe  no  transformation  of  such  an 
image.  As  proved  in  [2],  (3)  is  a  necessary  and  sufficient  condition  for  the  existence 
of  a  stable  equilibrium  in  each  total  saturation  region.  Conditions  weaker  than  (3) 
would  therefore  allow  bipolar  images  to  undergo  change  because  any  trajectory 
with  initial  conditions  in  a  total  saturation  region  which  contains  no  equilibrium 
must  leave  this  region  and  hence  undergo  a  nonlinear  transformation.  In  practice 
therefore  we  seek  weaker  conditions  which  ensure  complete  stability. 

As  a  final  remark,  let  us  return  to  the  more  general  case  considered  in  the  first 
sections  of  this  paper.  We  have  a  compact  phase  space,  partitioned  by  the  PWL 
vector  field.  If  the  matrix  B  has  eigenvalues  with  positive  real  parts  one  would  like 
to  be  able  to  say  that  trajectories  are  ‘expanding’  and  thus  reach  total  saturation 
regions.  However,  there  are  examples  of  CNNs  with  such  eigenvalues  possessing 
closed  orbits  -  all  that  one  can  say  is  that  no  trajectory  can  remain  in  a  partial 
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saturation  region  if  such  an  eigenvalue  exists.  The  diagonal  dominance  condition 
ensures  that  recrossing  of  a  partial  saturation  boundary  cannot  occur  and  as  we 
have  seen  this  has  a  profound  effect  on  the  global  behaviour  of  trajectories. 
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APPLICATIONS  OF  THE  COMPARTMENTAL  MODEL 
NEURON  TO  TIME  SERIES  ANALYSIS 
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In  this  paper  we  discuss  an  extended  model  neuron  in  ANN.  It  is  the  compartmental  model  which 
has  already  been  developed  to  model  living  neurons.  This  type  of  neuron  belongs  to  the  class  of 
temporal  processing  neurons.  Learning  of  these  neurons  can  be  achieved  by  extending  the  model 
of  Temporal  Backpropagation.  The  basic  assumption  behind  the  model  is  that  the  single  neuron 
is  considered  as  possessing  finite  spatial  dimension  and  not  being  a  point  processor  (integrate  and 
fire).  Simulations  and  numerical  results  are  presented  and  discussed  for  three  applications  to  time 
series  problems. 

1  Introduction 

Most  models  of  a  neuron  used  in  ANN  neglect  the  spatial  structure  of  a  neuron’s 
extensive  dendritic  tree  system.  However,  from  a  computational  viewpoint  there  are 
a  number  of  reasons  for  taking  into  account  such  a  structure,  (i)  The  passive  mem¬ 
brane  properties  of  a  neuron’s  dendritic  tree  leads  to  a  spread  of  activity  through 
the  tree  from  the  point  of  stimulation  at  a  synapse.  Hence  spatial  structure  in¬ 
fluences  the  temporal  processing  of  synaptic  inputs,  (ii)  The  spatial  relationship  of 
synapses  relative  to  each  other  and  the  soma  is  important,  (iii)  The  geometry  of  the 
dendritic  tree  ensures  that  different  branches  can  function  almost  independently  of 
one  another.  Moreover,  there  is  growing  evidence  that  quite  complex  computations 
are  being  performed  within  the  dendrites  prior  to  subsequent  processing  at  the 
soma. 

The  (linear)  cable  theory  models  try  to  explain  the  creation  of  the  action  potential 
[1].  These  models  belong  to  the  class  of  Hodgkin- Huxley  derived  models.  The  one¬ 
dimensional  cable  theory  describes  current  flow  in  a  continuous  passive  dendritic 
tree  using  P.D.E.s.  These  equations  have  straightforward  solutions  for  an  idealised 
class  of  dendritic  trees  that  are  equivalent  to  unbranched  cylinders.  For  cases  of 
passive  dendritic  trees  with  a  general  branching  structure  the  solutions  are  more 
complicated.  When  the  membrane  properties  are  voltage  dependant  then  the  ana¬ 
lytical  approach  using  linear  cable  theory  is  no  longer  valid.  One  way  to  account 
for  the  geometry  of  complex  neurons  has  been  explored  by  Abbott  et  al.  [2].  Using 
path-integral  techniques,  they  construct  the  membrane  potential  response  function 
(Green’s  function)  of  a  dendritic  tree  described  by  a  linear  cable  equation.  The  re¬ 
sponse  function  determines  the  membrane  potential  arising  from  the  instantaneous 
injection  of  a  unit  current  impulse  at  a  given  point  on  the  tree. 

A  complementary  approach  to  modelling  a  neuron’s  dendritic  tree  is  to  use  a  com¬ 
partmental  model  in  which  the  dendritic  system  is  divided  into  sufficiently  small 
regions  or  compartments  such  that  spatial  variations  of  the  electrical  properties 
within  a  region  are  negligible.  The  P.D.E.s  of  cable  theory  then  simplify  to  a  sys¬ 
tem  of  first  order  O.D.E.s.  From  these  equations  we  can  then  calculate  the  response 
function,  i.e.  the  law  with  which  the  action  potential  is  created,  see  [3],  [4],  [5].  Using 
the  previous  ideas,  we  construct  an  artificial  neuron  through  appropriate  simplifica¬ 
tions  in  the  structure  of  the  neurobiological  compartmental  model  and  its  response 
function. 
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2  The  Compartmental  Model  Neuron 
2.1  The  Transfer  Function 


The  artificial  neuron  is  composed  of  a  set  of  compartments,  in  which  connections 
from  other  neurons  of  the  network  are  arriving.  The  transfer  function  for  neuron  i 
in  the  layer  I  in  time  instance  t  has  the  form: 


s(v;?w)  =  4eE/i/ji(«Kw| 

(1) 

o 

II 

o 

11 

Mti 

J=1 

(2) 

where  s  is  any  suitable  nonlinear  function,  usually  the  sigmoid,  is  the  net 

potential  at  “soma”  (compartment  0)  at  time  t,T  is  the  number  of  time  steps  used 
to  keep  a  history  of  the  neuron,  m  is  the  number  of  compartments  in  the  neuron, 
M^-  is  the  number  of  incoming  weights  to  compartment  /?  (note  that  this  number 
is  compartment  related),  u^{n)  is  the  net  potential  at  compartment  /?  at  time 
instance  n  and  f\i3\{n)  is  a  kernel  function  corresponding  to  the  response  function 
of  the  respective  neurobiological  model  and  has  the  following  general  form: 


f\a-p\{t)  =  G{a,t-I3,0)  =  e 


'a' 

.7. 


(3) 


The  parameters  7  and  r  determine  effectively  the  delay  rate  in  the  propagation  of 
the  signal  from  compartment  /?  at  time  0  (a  stimulus  is  coming  there)  to  compart¬ 
ment  a  at  time  t.  The  function  In{t)  is  the  modified  Bessel  function  of  integer  order 


For  the  purpose  of  the  artificial  neuron  the  parameters  can  either  remain  constant 
or  adapt  during  the  simulation.  The  connections  that  exist  between  neurons  are 
assumed  to  be  like  the  ones  of  the  Temporal  Backpropagation  case,  i.e.  having 
tapped  delays  up  to  a  desirable  order  iV,  see  [6].  In  this  way  the  neuron  produces 
a  signal  that  depends  also  on  the  previous  values  of  its  input. 

2.2  The  Learning  Law 

In  applying  the  compartmental  model  to  problems  of  supervised  learning  we  devise 
a  learning  law  that  is  based  on  Temporal  Backpropagation,  but  is  modified  appro¬ 
priately  to  account  for  the  existence  of  distinct  compartments.  Assuming  an  MSE 
error  function  and  a  sigmoid  transfer  function  of  the  type: 

then  the  learning  law  (stochastic  case)  has  the  following  form: 

=  +  (5) 

-  s)6u{t)  +  (6) 

6ii{t)  =  s'{t){-2)eii{t)  {otI  =  L  (7) 

Ml,  <+n-+i)p 

p  =  l  t>-t 


(8) 
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In  the  above  equations,  77  and  a  are  the  learning  rate  and  momentum  coefficients, 
s  =  is  a  temporal  index,  Tu  is  the  temporal  dimension  of  neuron  i  at  layer 

/.  In  equation  (6)  t  is  referring  to  current  time  and  s  is  then  the  time  delay  from 
current  time.  L  is  the  output  layer,  eii{t)  is  the  error  in  the  output  layer,  s'  is  the 
derivative  of  the  transfer  function,  is  the  number  of  outgoing  connections  from 
neuron  i.  Wuj  are  the  weights  connecting  neuron  j  at  layer  (/  —  1)  to  neuron  i  at 
layer  1.  In  the  notation  we  take  into  account  that  the  weights  are  associated  with 
different  compartments.  The  index  x  in  the  weights  denotes  that  compartment  in 
which  an  incoming  connection  is  terminated  and  the  index  d  is  the  compartment 
(of  neuron  p  at  layer  (/  +  1))  to  which  an  outgoing  connection  of  i  is  connected. 
Finally  the  6u  are  defined  as  follows; 


6ii{t)  - 


dE 


(9) 


dVu(t) 

where  E  is  the  error  function  and  Vii{t)  is  the  activation  of  the  neuron  before  the 
transfer  function. 


3  Simulation  Results 

We  tested  the  model  using  three  time  series:  the  logistic  map  series  near  its  chaotic 
region,  an  astronomical  series  which  describes  the  interaction  of  a  gravitational 
wave  with  a  distribution  of  particles  in  the  interstellar  media  [7],  and  a  solid  state 
physics  series  which  describes  the  chaotic  voltage  oscillations  in  the  NDR  region 
of  the  I-U  characteristics  registered  in  V2  O5  single  crystals  [8].  In  all  cases  the 
parameters  7  and  r  were  kept  constant. 

3.1  Logistic  Map 

In  Fig.  1  we  see  the  performance  of  the  model  in  comparison  with  a  Temporal 
Backpropagation  network  of  same  degrees  of  freedom.  The  network  for  the  com- 
partmental  model  employed  was  a  1-15-1  architecture  with  3  tapped  delay  lines 
for  the  case  of  hidden  neurons  and  1  line  for  the  output  unit.  In  both  networks 
we  used  400  points  for  training,  200  points  for  validation  and  250  points  for  test¬ 
ing  the  mapping  that  was  achieved.  In  the  testing  set  after  the  first  200  points  we 
used  multi-step  iterative  prediction  for  the  remaining  50  points.  The  parameters 
7  and  r  had  values  1.0  and  2.0  respectively.  The  incoming  connections  from  the 
hidden  neurons  to  the  output  neuron  were  connected  to  compartment  number  zero 
( “soma” ) . 

The  parameters  for  the  logistic  map  were  A  =:  3.8  and  the  initial  condition  was 
a; 0  =  0.1  In  Fig.  1  we  see  the  region  of  multi-step  prediction,  ranging  from  number 
900  up  to  number  932  for  both  methods.  The  solid  line  is  the  logistic  map  series,  the 
sparse  dashed  line  is  the  Compartmental  model  prediction  and  the  dense  dashed 
line  is  the  Temporal  Backpropagation  model. 

We  see  initially  that  after  two  points  with  correct  match  the  Compartmental  model 
does  not  predict  correctly  the  following  twelve  points.  The  same  behaviour  is  ob¬ 
served  for  the  Temporal  Backpropagation  model.  But  afterwards,  strangely  enough, 
the  model  closely  follows  the  underlying  mapping  for  the  remaining  points.  This 
behaviour  extends  further  in  the  series. 


3.2  Astronomical  Series 

In  Fig.  2  we  see  a  series  that  is  describing  the  interaction  of  a  gravitational  wave 
with  the  interstellar  plasma.  The  interaction  that  is  shown  is  chaotic. 
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Figure  1  Logistic  Time  Series.  Logistic  map,  1=3.8.  T=l,  G=2.  Network  1-15-1. 
TDimH  =  3,  TDimO  =  1.  Multistep  Prediction  region  900-932. 


Figure  2  Astrophysical  Time  Series.  T  =  1,  G  =  1.  Network  1-5-1.  TDimH  =  3, 
TDimO  =  1.  Connections  to  “soma”. 


Again  the  Compartmental  and  Temporal  Backpropagation  models  were  used  for 
comparison.  The  architecture  that  was  employed  was  a  1-5-1  network  with  3  tapped 
delays  for  the  hidden  neurons  and  1  delay  line  for  the  output  neuron.  The  r  and 
7  parameters  were  1.0  and  1.0  respectively.  The  incoming  connections  to  output 
neuron  were  connected  to  the  “soma”  .  Again  400  points  were  used,  here  though 
from  the  range  400-800,  because  we  wanted  to  avoid  the  initial  transient  period. 
The  prediction  that  was  sought  was  in  the  range  of  800-1000,  and  150  points  were 
used  for  single  step  prediction  and  the  remaining  50  points  were  generated  by  a 
multi-step  prediction  scheme. 
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The  solid  line  corresponds  to  the  original  series,  the  sparse  and  dense  dashed  lines 
represent  the  Compartmental  and  Temporal  Backpropagation  model  predictions 
respectively. 

Fig.  2  shows  that  a  good  approximation  of  the  underlying  map  was  achieved  by 
both  models  even  though  a  fairly  simple  architecture  was  employed  and  the  training 
set  was  not  the  most  representative. 

3.3  Solid  State  Series 

In  Fig.  3  we  see  a  series  that  is  describing  the  chaotic  voltage  oscillations  in  the 
NDR  region  of  the  crystal. 


Figure  3  Solid  State  Physics  Time  Series.  T=l,  G=2.  Network  1-5-1.  TDimH  = 

8,  TDimO  =  2.  Connections  to  “soma”.  Network  1-5-1.  TDimH  =  TDimO  =  1.5 
comps /neuron.  Fully  connected. 

Again  the  Compartmental  and  Temporal  Backpropagation  models  were  used  for 
comparison.  The  architecture  that  was  employed  was  a  1-5-1  network  with  8  tapped 
delays  for  the  hidden  neurons  and  2  delay  line  for  the  output  neuron.  The  r  and 
7  parameters  were  1.0  and  2.0  respectively.  The  incoming  connections  to  output 
neuron  were  connected  to  the  “soma”.  For  training,  450  points  were  used,  from 
the  range  1-450,  200  points  were  used  for  validation  in  the  range  500-700.  The 
prediction  that  was  sought  was  in  the  range  of  700-1000,  and  200  points  were 
used  for  single  step  prediction  and  the  remaining  100  points  were  generated  by  a 
multi-step  prediction  scheme. 

Here  we  try  to  tackle  also  the  problem  of  how  the  incoming  weights  to  a  neuron 
should  be  distributed  among  its  compartments.  Here  instead  of  connecting  just  to 
one  compartment,  we  tested  the  idea  of  connecting  to  all  the  compartments.  For  this 
purpose  we  used  a  1-5-1  network  structure,  with  1  tapped  delay  line  per  hidden  and 
output  neuron.  But  instead  the  neurons  consist  of  5  compartments  each,  where  we 
are  connecting  the  incoming  signal  from  every  other  neuron.  The  key  to  note  here 
is  that  the  value  of  the  incoming  signal  is  the  same  for  all  the  five  compartments, 
but  different  weights  exist  for  every  connection  to  a  specific  compartment. 
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In  Fig.  3  the  solid  line  corresponds  to  the  original  series,  the  flat  sparse  and  expo¬ 
nentially  decaying  dense  dashed  lines  represent  the  Compartmental  and  Temporal 
Backpropagation  model  predictions  respectively.  The  dense  dashed  line  is  the  Com¬ 
partmental  model  with  full  connectivity  to  all  compartments. 

Fig.  3  shows  that  a  good  approximation  of  the  underlying  map  was  achieved  by  both 
models  even  though  a  fairly  simple  architecture  was  employed.  We  were  surprised 
by  the  fact  that  the  best  approximation  to  the  underlying  map  was  achieved  by  the 
fully  connected  model. 

Finally  we  have  to  mention  here,  that  for  comparison  the  parameters  of  the  nets  were 
chosen  to  produce  exactly  the  same  number  of  free  parameters  (weights)  namely 
fifty  for  all  three  models. 

4  Conclusion 

From  these  initial  simulations  we  see  that  in  general  the  Compartmental  model  is 
at  least  as  successful  as  the  Temporal  Backpropagation  model  for  the  time  series 
involved.  Even  though  fairly  simple  architectures  were  used,  the  underlying  mapping 
was  approximated  reasonably  well  as  the  single  step  predictions  are  showing.  For 
the  multi-step  predictions  further  simulations  are  needed,  in  order  to  specify  a  more 
appropriate  architecture  and  parameter  range  that  leads  to  better  performance. 
Further  research  is  now  being  carried  out  to  investigate  the  complex  couplings  of 
the  parameters  that  control  the  behaviour  of  the  model.  Also  a  major  issue  of  the 
model  is  the  scheme  with  which  we  choose  to  assign  the  incoming  connections  of 
each  neuron  to  its  compartments  [11]. 
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The  purpose  of  this  article  is  to  describe  some  new  applications  of  information-theoretic  concepts 
in  unsupervised  learning  with  particular  emphasis  on  the  implementation  of  contextual  guidance 
during  learning  and  processing. 

1  Introduction 

Building  on  earlier  work  by  Linsker,  and  Becker  and  Hinton  [8,2],  Kay  and  Phillips 
[7]  used  the  concept  of  three-way  mutual  information  in  order  to  define  a  new  class 
of  objective  functions  designed  to  maximise  the  transfer  of  the  information  shared 
between  a  set  of  inputs  (the  receptive  field)  and  outputs  that  is  predictably  related 
to  the  context(the  contextual  field)  in  which  the  learning  occurs  and  termed  one 
of  this  new  class  of  objective  functions  Coherent  Infomax.  In  addition  they  in¬ 
troduced  a  new  activation  function  which  combines  information  flowing  from  the 
receptive  and  contextual  fields  in  a  novel  way  within  local  processors.  Two  illustra¬ 
tions  of  the  role  of  contextual  guidance  are  given  in  [7]  and  further  demonstrations 
are  provided  in  [9].  This  work,  however,  considered  only  the  case  where  the  local 
processors  have  binary  output  units.  In  this  article  the  methodology  proposed  in 
[7]  is  extended  to  the  case  of  general  multivariate  Gibbs  distributions  but  will  be 
described,  for  simplicity,  in  the  particular  case  of  multivariate  binary  outputs.  The 
article  proceeds  as  follows.  In  section  2  notation  will  be  described  and  probabilistic 
modelling  of  the  multivariate  outputs  considered.  The  definition  of  suitable  objec¬ 
tive  functions  will  be  discussed  in  section  3  and  various  local  objective  functions 
introduced.  The  learning  rules  will  be  presented  in  section  4  and  finally  in  section 
5  computational  issues  will  be  briefly  considered. 

2  Probabilistic  Modelling 

We  consider  a  local  processor  having  multiple  outputs.  This  processor  receives  input 
from  two  distinct  sources,  namely,  its  receptive  field  inputs, R  =  {i7i,  7^2,  ••■j 
and  its  contextual  field  inputs,C  =  {Ci,C2,  and  produces  its  outputs, X  = 

{Xi,X2,  where  R,  C  and  X  are  taken  to  be  random  vectors  and  we  adopt 

the  usual  device  of  denoting  a  random  variable  by  a  capital  letter  and  its  observed 
value  by  the  corresponding  lower-case  letter.  In  order  to  allow  explicitly  for  the 
possibility  of  incomplete  connectivity  we  define  connection  neighbourhoods  for  the 
ith  ouput  unit  Xi.  Let  di{r),  di{c)  and  di{x)  denote,  respectively,  the  set  of  indices 
of  the  RF  input  units,  the  set  of  indices  of  the  CF  inputs  and  the  set  of  indices  of  the 
outputs  that  are  connected  to  the  ith  output  unit  X^.  The  corresponding  random 
variables  are  denoted  by  R9j,Caj,Xai  respectively  and  the  set  of  all  components 
of  X  excluding  the  ith  component  is  X_i.  The  weights  on  the  connections  into  the 
ith  output  unit  are  given  by  Wij,Vij  and  Uij  for  the  jth  RF  input,  jth  CF  input 
and  the  jth  output  unit  respectively  and  we  assume  that  the  weights  connecting 
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the  output  units  to  each  other  are  symmetric.  We  now  define  the  integrated  fields 
in  relation  to  the  ith  output  unit. 

—  Y^jedi(r)  IS  the  RF  integrated  field. 

~  '^id  is  the  CF  integrated  field. 

Si{x)  =  UijXj  is  the  output  integrated  field. 

The  weights  Wio  and  t;;o  are  biases. 

The  activation  function  at  the  ith  output  unit  is  now  a  function  of  three  integrated 
fields  and  we  shall  take  it  to  have  the  following  form 

Ai  =  A{si{r),  Si{c))  +  =  a,-  +  (1) 

although  we  shall  derive  the  learning  rules  in  the  general  case.  This  particular  way 
of  incorporating  the  integrated  output  has  been  chosen  so  that  this  local  activation 
function  at  the  ith  output  unit  is  consistent  with  the  definition  of  a  multivariate 
model  for  X.  The  activation  function  A  which  binds  the  activity  of  the  RF  and  CF 
integrated  fields  is  that  proposed  in  [7]  defined  by 

A{si,S2)  =  j5i(1 +exp(2siS2))  (2) 

We  assume  that  the  output  vector  X  follows  a  binary  markov  random  field  model 
[3]  conditional  on  the  RF  and  CF  inputs,  with  probability  mass  function 


Pr{X  =  x|R  =  r,  C  =  c) 


exp(ELi^i^e  +  iELiE 

u) 


■j^di{x) 


where  Z[a.,  u)  is  the  normalisation  constant  (i.e.  not  a  function  of  x)  required  to 
ensure  that  the  probabilities  sum  to  unity.  This  model  is  a  regression  model  in  two 
distinct  senses.  Firstly,  via  the  terms  {0^}  which  are  general  nonlinear  functions 
of  the  RF  and  CF  inputs,  it  is  a  nonlinear  regression  of  the  outputs  with  respect 
to  all  of  the  inputs.  Secondly,  when  written  in  conditional  form  in  equation  (3), 
it  expresses  an  auto-regression  for  each  output  unit  in  terms  of  the  other  output 
units  in  its  neighbourhood.  The  formulation  developed  in  the  above  model  has  the 
advantage  of  interfacing  a  feed-forward  network  between  layers  with  a  recurrent 
network  structure  within  the  output  layer  within  a  single  coherent  probabilistic 
framework.  Not  only  that  but  it  is  also  possible  to  connect  the  multiple  output  local 
processors  themselves  in  a  multi-layered  network  structure  in  a  probabilistically 
coherent  manner. 


From  this  model  the  local  conditional  distributions  may  be  derived.  As  we  shall 
see  using  these  distributions  provides  a  local  structure  to  the  learning  rules  and, 
under  the  restrictions  on  the  output  connection  weights,  the  Hammersley-Clifford 
theorem  [3]  ensures  that  working  locally  with  the  conditional  models  is  equivalent 
to  assuming  a  coherent  global  model  for  the  outputs.  However  in  the  case  were 
the  output  units  are  fully,  mutually  connected  the  equations  presented  here  hold 
without  any  necessity  to  invoke  this  theorem  to  ensure  probabilistic  coherence  and 
are  derived  using  the  basic  rules  of  probability. 

The  local  conditional  distribution  for  the  ith  output  is 


9i  __  Pr{Xi  —  IjRgi  _  r^^-,  Cdi  —  CQi,  Xgi  =  xsi)  =  1/(1  -f  exp(— Aj)).  (4) 

Here  Ai  =  ai+Si{x),  where  m  is  any  general  differentiable  function  of  the  integrated 
RF  and  CF  fields. 
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3  Global  and  Local  Objective  Functions 

We  now  consider  a  global  objective  function  based  on  the  joint  distribution  of  all 
outputs,  RF  inputs  and  CF  inputs.  In  the  case  of  multivariate  outputs,  we  consider 
the  general  version  of  the  objective  function  introduced  in  [7]  which  is 

F  =  /(X;  R;  C)  +  R|C)  +  ^liX;  C|R)  +  C).  (5) 

Here  the  term  I(X;R;C)  is  the  three-way  mutual  information  between  the  random 
vectors  X,R  and  C  given  by 

/(X;  R;  C)  /(X;  R)  -  I(X;  R|  C)  (6) 

and 

/(X;  R)  =  H{X)  -  (X|R)  (7) 

is  the  mutual  information  shared  between  the  random  vectors  X  and  R  and  the 
symbol  H  denotes  Shannon’s  entropy.  For  further  details  see  [4].  The  objective 
function  in  equation  (5)  is  based  on  the  three-way  mutual  information  and  the 
three  possible  conditional  measures  of  (two-way)  mutual  information. 

For  the  purposes  of  modelling  this  is  expressed  as 

F  =  H{X)  -  i>iH{X\K)  -  V’2i?(X|C)  -  ^3i^(X|R,  C).  (8) 

In  terms  of  the  output,  and  indeed  the  input,  units  this  is  a  global  objective 
function  and  general  learning  rules  have  been  derived  in  the  general  case  of  Gibbs 
distributions.  However  these  lead  to  learning  rules  that  are  global  and  also  compu¬ 
tationally  challenging  in  their  exact  form  and  so  we  now  describe  a  particular  local 
approximation  to  the  global  objective  function  F;  other  definitions  of  locality  are 
possible  [5].  In  this  multiple  output  case,  it  is  natural  to  think  of  the  processing 
locally  in  terms  of  each  output  unit  using  the  information  available  to  it  from  its 
RF,  CF  and  output  neighbourhoods.  This  suggests  that  one  might  focus  in  turn  on 
the  joint  distribution  of,  say,  the  ith  output  and  its  RF  and  CF  inputs  given  the 
neighbouring  output  units.  From  this  perspective  it  then  seems  natural  to  intro¬ 
duce  the  concept  of  the  conditional  three-way  mutual  information  shared  by  the 
ith  output  and  its  RF  and  CF  inputs  given  the  neighbouring  outputs  denoted  by 

I{Xi‘R9i;Cdi\Xai)  =  /(X,-;  Rai|Xai)  -  /(X,-;  Rai|Ca,-, X^O  (9) 

It  is  possible  to  decompose  the  global  three-way  mutual  information  as  follows. 

7(X;  R;  C)  =  /(X,-;  Ra^;  Cai\Xei)  +  /(X„i;  R;  C)  (10) 

This  decomposition  may  be  repeated  recursively  and  is  of  particular  relevance  when 
the  output  units  represent  some  one- dimensional  structure  such  as  a  time  series  ; 
then  the  well-known  general  factorisation  of  joint  probability  into  a  product  of  a 
marginal  and  conditional  distributions  allows  the  general  three-way  mutual  infor¬ 
mation  to  be  written  as  a  sum  of  local  conditional  three-way  mutual  information 
terms.  However  such  simplicity  is  not  possible  here,  although  this  first-step  decom¬ 
position  shows  that  the  conditional  three-way  information  is  a  part  of  the  global 
three-way  information  in  a  well-defined  sense.  The  same  conditioning  idea  may  be 
applied  to  the  other  components  of  information  within  the  objective  function  F  and 
this  leads  to  the  specification  of  a  local  objective  function  for  the  ith  output  unit 
defined  as  follows 

Fi  =  I(Xi‘,Rai,^di\^di)  F  (l>iI{Xi;'Rai\^di,Xdi) 

-\-(f)2l{Xi]  C^zlRan  Xgj)  +  (l>sH{Xi\Ilaii  Ca*,  X^i), 


(11) 
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and  we  express  {F*}  in  the  more  useful  form 

(12) 

This  means  that  we  envisage  each  output  unit  working  to  maximise  Fi  and  because 
of  the  the  fact  that  mutually  distinct  sets  of  weights  connect  into  each  of  the 
outputs  this  is  equivalent  to  maximising  the  sum  of  the  FiS.  We  view  this  sum  as 
a  locally-based  approximation  to  the  global  objective  function  F.  In  the  extreme 
case  where  the  outputs  are  conditionally  independent  this  sum  is  equivalent  to  F, 
Obviously  the  approximation  will  be  better  the  smaller  are  the  sizes  of  the  output 
neighbourhood  sets  relative  to  the  number  of  outputs.  We  now  provide  formulae  for 
the  local  entropic  terms  and  the  components  of  local  information  for  the  ith  output 
unit. 


CaijXai)  —  {Oi  \ogOi  +  (1  —  0i)log(l  —  ^j))Rg,,Cai,Xai  (13) 

i?(Z,|Ra.-,Xa,)  =  l°s4L,x.  +  (l-4L,x,.)log(l-4L,x.<))R..,x.. 

(14) 

R(X4Ca.-.Xa,)  =  (£;gLx.,losi^gLx..  +  (l-4Lx..)log(l-4LxJ)c.,x.. 

(15) 

R(X,|Xa,)  =  (4)_  log 4),  +  (1  -  4),)log(l  -  4)_))x,.  (16) 

It  follows  that  the  components  of  local  information  at  the  ith  output  unit  are  as 
follows. 

I{Xi',TLai',  CgjjX^f)  —  (16)  —  (15)  —  (14)  -{-  (13), 


/(Xi;  RsjjCai,  Xai)  =  (15)  —  (13), 

/(X,;Ca.-lRai,XaO-(14)-(13), 


F(X,|Rai,Cai,XaO  =  (13) 

The  {E}  terms  are  averages  of  the  output  probabilities  at  the  ith  unit  and  are 
defined  at  the  end  of  section  4. 

4  The  Learning  Rules 

Using  the  locally-based  objective  functions  developed  in  section  2  and  the  formula¬ 
tion  so  far,  it  turns  out  that  the  learning  rules  for  all  weights  have  the  same  general 
structure  as  those  introduced  in  [7]. 

We  describe  the  gradient  ascent  learning  rules  in  relation  to  the  ith  output  unit. 

^  -  0,)S.(1  -  ^.•)^iJ.)Ra.,Ca.,X... 


dsi(r) 

for  each  RF  input  s  which  connects  into  the  ith  output  unit. 

^  =  ((^^3^;  -  Oi)0i{l  -  0;)^a)R,.,Ca,,Xa, 


for  each  CF  input  s  which  connects  into  the  ith  output. 

^  -  ((V’3^-  -  di)ei{l  -  S.)^^^^»)Rai,Ca„Cs, 

for  each  output  unit  s  which  connects  into  the  ith  output.  Note  that  these  learning 
rules  are  local  and  this  results  from  the  decision  to  separately  maximise  the  local 


(17) 

(18) 
(19) 
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objective  functions  {Fz}(  or  equivalently  to  maximise  the  sum  of  the  {Fj}).  The 
dynamic  average  for  the  ith  output  unit  is 

p(0 

6i  =  log - ^ - V"!  log - - V’S  log - - .  (20) 

(i-4Lxj 

Here  the  dynamic  averages  are  more  complicated  than  in  the  single  output  case 
and  their  calculation  involves  storing  the  average  probability  at  the  ith  output  unit 
for  each  pattern  of  the  other  outputs  that  connect  into  the  ith  output  unit,  for  the 
combination  of  each  of  the  neighbouring  output  and  RF  input  patterns  and  for  the 
combination  of  each  of  the  neighbouring  output  and  CF  input  patterns.  However 


various  approximations  are  possible  [5]  The  various  averages  of  the  probability 
at  the  ith  output  unit  are  defined  as  follows.  -E'xi-  ~  (^*)R.,9i,CaiIXai>-^R5  Xe  “ 

{^*)cadRai,Xa..^S.,Xa.  =  (^*‘>Ra.|Cai,Xai .  and  the  required  partial  derivatives  may 
be  easily  calculated. 


5  Some  Computational  Details 

The  computation  may  be  performed  using  on-line  learning  as  was  the  case  in  [7]  or, 
alternatively,  using  batch  learning.  In  the  case  of  on-line  learning  the  weight  change 
rules  are  applied  with  the  averaging  brackets  removed  and  the  required  conditional 
averages  of  output  probabilities  may  be  updated  dynamically  during  learning  after 
the  presentation  of  each  pattern  via  recursive  computation.  In  particular  recursive 
computation  may  be  used  to  avoid  the  need  to  explicitly  calculate  the  statistics  of 
the  input  data  and  then  employ  a  two-stage  approach  to  the  learning.  The  method¬ 
ology  has  been  applied  in  a  number  of  computational  experiments  described  in  [6] 
which  demonstrate  the  feasibility  of  the  approach.  It  is  shown  there  that  the  differ¬ 
ences  between  the  Infomax  and  Coherent  Infomax  computational  goals  described 
by  [7]  hold  in  this  more  general  set  up  and  that  the  multiple  output  unit  is  capable 
of  representing  its  inputs  by  means  of  population  codes.  Further  experimentation 
is  currently  in  progress  and  this  will  address  the  computational  feasibility  of  scal¬ 
ing  up  to  large  multiple  output  units  and  evaluate  various  approximations  for  the 
conditional  dynamic  averages. 
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A  minimization  estimator  minimizes  the  empirical  risk  associated  with  a  given  sample.  Sometimes 
one  calculates  such  an  estimator  based  not  on  the  original  sample  but  on  a  pseudo  sample  obtained 
by  adding  noise  to  the  original  sample  points.  This  may  improve  the  generalization  performance  of 
the  estimator.  We  consider  the  convergence  properties  (consistency  and  asymptotic  distribution) 
of  such  an  estimation  procedure.  Subject  classification:  AMS(M0S)62F12 

1  Introduction 

In  backpropagation  training  one  usually  tries  to  minimize  the  empirical  risk 

•“  =  min!, 

where  x  h-)-  g{x,t)  denotes  the  input/output  mapping  of  a  multilayer  perceptron 
whose  weights  comprise  the  vector  t.  When  the  training  sequence  {Zi},  Zi  — 
{Xi,Yi),  consists  of  i.i.d.  (independent  and  identically  distributed)  random  vec¬ 
tors,  it  is  known  that  under  general  conditions  the  parameter  r„  minimizing  the 
empirical  risk  is  strongly  consistent  for  the  minimizer  set  of  the  population  risk 
r{t)  —  E\\Yi  -  g{Xi,t)\\^.  Further,  if  the  sequence  {r„}  converges  towards  an  iso¬ 
lated  minimizer  t*  of  r,  then  under  reasonable  conditions,  \/n(r„  —t*)  converges  in 
distribution  to  a  normal  distribution  with  mean  zero,  see  White  [8]. 

Here  we  consider  what  happens  in  the  above  procedure  when  instead  of  the  original 
data  one  uses  data  generated  by  adding  noise  to  the  original  sample  points.  Such  a 
practice  has  been  suggested  by  several  authors,  see  [4]  for  references.  The  relation¬ 
ship  of  this  procedure  to  regularization  has  been  investigated  in  many  recent  papers, 
see  [6,  3,  1,  7].  In  the  present  paper  we  outline  both  consistency  and  asymptotic 
distribution  results  for  noisy  training.  The  results  are  based  on  the  doctoral  thesis 
of  the  author  [5],  where  the  interested  reader  can  find  rigorous  proofs  and  results 
which  are  more  general  than  what  can  be  covered  here.  The  consistency  results  ob¬ 
tained  in  the  thesis  are  much  stronger  than  the  previous  results  of  Holmstrom  and 
Koistinen  [4].  Asymptotic  distribution  results  in  our  setting  have  not  been  available 
previously. 

2  The  Statistical  Setting 

The  original  sample  is  part  of  a  sequence  Z2, . . .  of  random  vectors  taking  val¬ 
ues  in  IR*.  The  noisy  sample  is  generated  by  replacing  each  original  sample  point 
Zi, . .  .,Zn  with  s  >  1  pseudo  sample  points 

^nij  =  +  hnNij,  (1) 

Here  the  h„^s  are  positive  random  variables  called  the  smoothing  parameters.  We 
assume  that  the  noise  vectors  Nij^s  are  i.i.d.  and  independent  of  the  Z^’s  and  the 
hn’s.  The  smoothing  parameters  are  allowed  to  depend  on  the  {Zi}-sequence.  We 
need  to  let  converge  to  zero  to  ensure  consistency,  and  therefore  the  pseudo 
sample  points  are  not  i.i.d.  This  prevents  us  from  using  standard  convergence  results 
for  minimization  estimators  to  analyze  the  convergence  properties  of  noisy  training. 
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We  next  define  empirical  measures  associated  with  the  original  and  the  pseudo 
observations.  Let  denote  the  probability  measure  with  mass  one  at  ic  €  and 
define  the  probability  measure 

n 

yn  ~ 

izzl 

i.e.,  fin  places  mass  1/n  at  each  of  the  observations  Zi,...,Zn.  We  call  fin  the 
empirical  measure  of  the  first  n  observations.  Similarly,  we  define  the  empirical 
measure  of  the  pseudo  observations  Zf-,i  —  1, . . . ,  n,  j  =  1, . . . ,  s  as  the  prob¬ 
ability  measure  which  places  mass  l/{ns)  at  each  of  these  ns  points.  The  measures 
fin  and  fifg  are  examples  of  what  are  called  random  probability  measures;  the  ran¬ 
domness  arises  from  the  fact  that  the  measures  depend  on  the  observed  values  of 
random  vectors. 

For  the  consistency  results,  we  need  to  assume  that  the  empirical  measures  fin 
converge  weakly,  almost  surely,  towards  some  probability  measure  y  in  IR^;  in 
symbols,  yn^y-  The  definition  of  this  mode  of  convergence  for  any  sequence  {A„} 
of  random  probability  measures  in  IR*  is  as  follows, 


Xn^y  if  and  only  if  (  V/  €  Cb 


jfi,) 


almost  surely, 


where  Cb  is  the  set  of  bounded  continuous  functions  IR*  — »■  IR.  An  argument  due 

j  T  7  1 _  ’ _  rr»  ml  11  a  _  * _ l  * _ il _ x  r _ _ _ _ _  ™  TD  ^ 


V/eCfc, 


to  Varadarajan  [2,  Th  1 1.4. 1]  then  implies  that  for  measures  in  IR  , 

Xn^y  if  and  only  if  J  f  dXn-^  J  f  dy,  V/  G  Cb^  (2) 

Unlike  the  condition  of  the  actual  definition,  condition  (2)  is  often  easy  to  verify. 
E.g.,  if  Zi,Z2, . . .  are  i.i.d.,  then  the  strong  law  of  large  numbers  implies  that  the 
empirical  measures  yn^y,  where  y  denotes  the  common  law  of  the  Zi^s  on  IR*.  The 
same  conclusion  holds  by  the  ergodic  theorem,  if  the  sequence  {Zj}  is  stationary 
and  ergodic. 

Suppose  that  the  empirical  measures  yn^y,  where  y  is  some  (nonrandom)  proba¬ 
bility  measure  in  IR*.  Let  T  C  IR"^  be  our  parameter  set  and  let  there  be  defined 
a  loss  function  C  on  IR*  x  IR”^ .  The  aim  is  to  select  the  parameter  i  £T  such  that 
the  risk 

r{t):=  J  £{z,t)y{dz),  t£T  (3) 

is  minimal.  Here  the  measure  y  is  supposed  to  be  unknown,  so  we  cannot  solve 
the  problem  directly.  Instead  we  may  try  to  minimize  either  the  empirical  risk 
associated  with  the  original  observations 


rn{t)  =  J  £(z,t)  yn(dz) 


or  the  empirical  risk  associated  with  the  noisy  observations 


TIS  ....  -  J 


(5) 
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3  Consistency 

li  A  CT,  define  the  distance  oit  £T  from  A  by  d{t,  A)  :=  inf{||^  —  2/||  :  y  6  A},  with 
the  convention  that  d(/,0)  =  oo.  We  write  argmin-^r  for  the  (possibly  empty)  set 
of  points  that  minimize  r  on  T.  We  seek  conditions  guaranteeing  for  our  estimators 
6n  that 

argmin  r)— )-0. 

T 

If  this  holds,  we  say  that  On  is  strongly  consistent  for  the  set  argmin^^  r. 

The  following  result  is  proved  in  [5]. 

Theorem  1  Lei  s  >  L  and  for  some  probability  measure  pL,  fin^/j.,  then 

also  as  n  ^  oo. 


This  motivates  the  following  approach.  Let  {A„}  be  a  sequence  of  random  prob¬ 
ability  measures  and  (j.  a  nonrandom  probability  measure  such  that  Xn^fi  and 
let 

Rn(t)-.=  J  e{z,t)Xn(dz).  (6) 

Suppose  that  On  is  a  random  vector  with  values  in  T  such  that 


-Rn(^n)  -  inf  +  o(l),  (a.s.).  (7) 

Under  certain  conditions,  On  is  then  strongly  consistent  for  the  set  argmin  r,  as  is 
formulated  in  the  following  theorem.  Hence  we  obtain  a  consistency  theorem  for  any 
estimator  r„  coming  close  enough  to  minimizing  r„.  Provided  we  also  obtain 

a  consistency  theorem  for  any  estimator  coming  close  enough  to  minimizing  rf^. 


Theorem  2  Let  the  parameter  set  T  be  compact  and  lei  Suppose  i  is  con¬ 

tinuous  on  xT  and  dominated  by  a  continuous,  fx-iniegrable  function,  i.e., 

\e{z,t)\<  L(z),  z€TR.'‘,teT, 

where  L  >0  is  continuous  and  j  L  dp  <  oo.  If  in  addition,  f  L  dA„^  f  L  dp,  then 
any  random  vector  satisfying  (7)  is  strongly  consistent  for  the  set  argmin^^  r. 


In  practice,  the  most  useful  dominating  functions  are  the  scalar  multiples  of  the 
powers  \zY,p  >  1.  If  j  \z\p  pnidz)"^ p{dz)  <  oo  and  E\N\p  <  oo,  then  it 
can  be  shown  that  also  f  \z\p  p^g{dz)^  f  \z\p  p{dz).  This  facilitates  checking  the 
conditions  of  the  previous  theorem. 

4  Asymptotic  Distribution 

A  consistency  result  does  not  tell  how  quickly  the  estimator  converges.  One  way 
to  characterize  this  rate  is  by  giving  an  asymptotic  distribution  for  the  estimator. 
Our  asymptotic  distribution  results  are  of  the  form  ^/n{0n  C),  i.e.,  they 

state  that  ^/n{0n  -  t*)  converges  in  distribution  to  a  normal  law  with  mean  zero 
and  a  covariance  matrix  C.  Here  t*  denotes  a  minimizer  of  r.  Such  a  result  says 
that  the  law  of  On  collapses  towards  a  point  mass  at  t*  at  a  very  specific  rate,  e.g., 
in  the  sense  that  for  any  e  >  0,  —t*  \  converges  to  zero  in  probability  and 

^1/2+6 1^^  _  ^>K|  converges  to  infinity  in  probability. 

Henceforth  we  assume  that  the  original  observations  Zi,  Z2, . . .  are  i.i.d.  and  that 
the  noise  vectors  have  mean  zero.  The  effect  of  noisy  training  in  linear  regression  is 
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relatively  straightforward  to  analyze.  This  is  the  case  where  z  is  the  pair  (x,  y),  x  G 
6  m  and  i{z,t)  =  {x^t  —  y)^.  Denote  the  minimizer  of  by  r„  and  the 
minimizer  of  by  If  hn  =  Op{n~^^^),  i.e.,  if  converges  in  probability 

to  zero,  then  it  turns  out  that  and  r„  have  the  same  asymptotic  distribution.  If 
hn  converges  to  zero  at  some  slower  rate,  then  the  situation  is  more  complicated; 
when  it  can  be  obtained,  the  asymptotic  distribution  of  then  typically  depends 
on  the  sequence  {hn}. 

The  same  kind  of  results  hold  also  more  generally.  Let  now  t*  eT  satisfy  Vr(r)  =  0, 
i.e.,  t*  is  a  stationary  point  of  the  risk  r.  We  assume  that  t*  is  an  isolated  stationary 
point  in  the  interior  of  T.  Further,  we  assume  that  is  a  C^-function  on  IR*  x  IR”^, 
and  that  Zi  has  a  compactly  supported  law.  Then  the  matrices 

:=  E[V,U{Zi,ni  B  :=  Cov[V,^(^i,r)] 

are  well  defined.  Here  V*  and  denote  the  gradient  and  Hessian  operators  with 
respect  to  i,  respectively.  We  assume  that  A  is  invertible  and  that  B  is  nonzero. 
Further,  we  assume  that  the  noise  vectors  have  a  compactly  supported  law  and  that 
the  smoothing  parameters  satisfy  0  <  /in  <  M  for  some  constant  M, 

Under  these  assumptions,  let  hn  =  Let  {vn}  be  a  sequence  of  T-valued 

random  vectors  such  that 

Tn^t*,  and  J  Wti{z,Tn)  fin{dz)  = 

and  let  be  a  sequence  of  T-valued  random  vectors  such  that 

and  j  V,  £(z,  t*) 

Here  Tn  and  renders  the  respective  empirical  risk  stationary  in  an  asymptotic 
sense.  It  is  shown  in  [5]  that  then 

-r)4iV(0,yi"^5A-^)  and  (8) 

That  is,  the  asymptotic  distributions  of  Tn  and  coincide.  The  asymptotic  dis¬ 
tribution  for  Tn  is  naturally  the  same  as  in  [8,  Th.  2];  the  result  for  Tg^  is  new. 

This  asymptotic  distribution  can  be  used,  e.g.,  to  construct  an  asymptotic  confi¬ 
dence  interval  for  t*  or  for  the  risk  at  t* .  Instead  of  this,  we  now  use  this  result 
to  characterize  the  generalization  performance  of  our  estimators.  Let  the  sequence 
{9n}  stand  either  for  {rn}  or  for  The  generalization  performance  of  9n  can 

be  characterized  by  the  law  of  r(^„),  the  risk  evaluated  at  the  estimator.  It  can  be 
shown  that  its  asymptotic  distribution  is  now  given  by 

n[r{9n)  —  AUj  where  U  ^  N{0,  A~^ BA~^).  (9) 

5  Conclusions 

We  have  outlined  new  results  for  the  convergence  properties  of  minimization  esti¬ 
mators  in  noisy  training.  The  main  conditions  in  the  consistency  result  are  that 
the  empirical  measures  associated  with  the  original  sample  converge  weakly,  almost 
surely,  towards  some  measure  y  and  that  the  smoothing  parameters  /i„  — 0,  almost 
surely. 

The  main  conditions  for  the  asymptotic  distribution  result  are  the  following:  the 
original  sample  points  are  i.i.d.,  the  noise  vectors  have  zero  mean  and  /i„  =  0^(71“^/"^). 
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Under  certain  additional  conditions  we  then  have  that  the  asymptotic  distributions 
of  r„  and  are  identical,  where  denotes  the  minimizer  of  the  empirical  risk 
associated  with  original  sample  and  the  minimizer  of  the  empirical  risk  associ¬ 
ated  with  the  noisy  sample.  This  implies  that  the  asymptotic  distributions  of  r(r„) 
and  r[Tfg)  coincide  and  hence  additive  noise  can  have  only  a  higher-order  effect  on 
the  performance  of  a  minimization  estimator  as  the  sample  size  goes  to  infinity. 
However,  numerical  evidence  indicates  that  additive  noise  sometimes  does  improve 
the  generalization  performance  of  a  minimization  estimator,  at  least  with  small 
sample  sizes.  It  remains  to  be  seen  whether  this  effect  can  be  quantified  by  analyzing 
the  distribution  of  r{Tfg)  using  more  refined  asymptotic  methods. 
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The  diffusing  messenger  Nitric  Oxide  plays  an  important  role  in  the  learning  processes  in  the 
brain.  This  diffusive  leeiming  mechanism  adds  a  non-linear  and  non-local  effect  to  the  standard 
Hebbi£in  learning  rule.  A  diffusive  learning  rule  can  lead  to  topographic  map  formation  but  also  has 
a  strong  tendency  to  homogenise  the  synaptic  strengths.  We  derive  a  non-linear  integral  equation 
that  describes  the  fixed  point  and  show  which  parameter  regimes  lead  to  non-trivial  solutions. 
Keywords:  Nitric  Oxide,  self- organization,  topographic  maps 

Introduction 

Most  learning  rules  used  in  neurobiological  modelling  are  based  on  Hebb’s  postulate 
that  a  synaptic  connection  is  strengthened  if  and  only  if  there  is  both  a  post-synaptic 
and  pre-synaptic  depolarization  at  that  particular  synapse.  Recent  physiological 
research,  however,  shows  that  this  assumption  is  not  always  warranted.  Experiments 
in  areas  varying  from  rat  hippocampus  to  cats’  visual  cortex  show  that  there  is  a 
non-local  effect  in  learning. 

This  non-local  learning  is  often  thought  to  be  mediated  by  retrograde  messengers. 
Nitric  Oxide  (NO)  is  a  candidate  for  such  a  messenger  which  is  produced  post- 
synaptically  (both  in  the  dendrites  and  the  soma  [1]),  then  diffuses  through  the 
intracellular  space  and  is  taken  up  pre-synaptically.  The  chemical  properties  of 
NO  allow  it  to  diffuse  over  relatively  long  distances  in  the  cortex;  with  a  diffusion 
constant  of  3.3  x  10“^cm^/s  and  a  half-life  in  the  intracellular  fluid  of  4  ~  6  seconds, 
it  has  an  effective  range  of  at  least  150//m.  If  NO  were  an  ordinary  neurotransmitter 
it  would  be  hard  to  understand  how  the  specificity  of  neuronal  connections  in  the 
brain  on  a  scale  smaller  than  this  could  be  achieved.  An  explanation  has  been  found 
in  experimental  setups  with  locally  modifiable  concentration  of  NO:  physiologists 
have  been  able  to  demonstrate  that  low  levels  of  NO  lead  to  depression  of  synaptic 
strengths  (LTD)  and  high  levels  lead  to  Long  Term  Potentiation  (LTP)  (For  recent 
reviews  see  [3,  8]).  This  non-linearity  in  the  dependence  of  the  synaptic  change  on 
the  NO  concentration  is  crucial  to  attain  specificity  in  a  network  with  a  diffusing 
messenger. 

Detailed  simulations  of  such  a  a  non-linear  and  non-local  learning  rule  have  been 
performed  in  [2,  7].  In  these  simulations  pattern  formation  in  the  weights  was  seen 
to  occur  but,  as  the  input  patterns  and  the  initial  (partly  topographic)  weight 
distributions  were  already  organized,  it  is  difficult  to  tell  how  general  these  re¬ 
sults  are.  In  [5]  we  analysed  networks  with  a  diffusing  messenger  in  a  linearised 
reaction- diffusion  framework  and  found  that  homogeneous  weights  were  the  domi¬ 
nant  solutions  of  the  dynamics.  Following  up  a  suggestion  in  [4]  that  Nitric  Oxide 
could  underlie  a  mechanism  similar  to  the  neighbourhood  function  in  the  SOFM. 
we  showed  in  [6]  that  a  diffusing  messenger  can  indeed  support  the  development  of 
topographic  maps  without  the  need  for  a  Mexican  Hat  lateral  interaction. 

Here  we  extend  our  previous  analysis  to  arrive  at  a  fuller  account  of  the  non-linear 
effects.  We  derive  a  general  fixed  point  equation  for  the  weights  and  determine 
which  parameter  regimes  admit  non-zero  homogeneous  solutions. 
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1  The  Model 

We  consider  a  neural  field  consisting  of  leaky  integrator  neurons  with  activity  u, 
all  sampling  their  input  with  a  synaptic  density  function  r  from  a  field  of  synapses 
with  strengths  s.  Note  that  this  implies  that  synapses  do  not  “belong”  to  a  neuron, 
a  neuron  just  takes  its  input  from  the  synapses  within  its  reach  (determined  by 
the  function  r).  Inputs  a(x,xo)  are  Gaussian  shaped  with  the  maximum  at  and 
spread  a.  The  inputs  are  presented  with  equal  probability  at  each  position  aro-  Nitric 
Oxide  is  produced  in  the  dendrites  at  a  rate  proportional  to  the  local  depolarization, 
decays  at  a  rate  —1  and  has  diffusion  constant  ac.  Learning  is  non-linearly  dependent 
on  the  local  NO  level  through  the  function  /  which  captures  the  qualitative  effects 
described  above  (see  also  2).  Given  an  input  centred  at  xq,  the  dynamical  equations 
can  be  written  as: 

u(x,xo)  —  — u(x)  +  J  dx'r{x,x')s{x')a{x' ,xo), 

72(x,xo)  =  — n(x)  +  s(x)a(x,  xo)  +  «:V^n(x,  xo), 

s{x)  =  -s(x)  + /[n(x,xo)]a(x,xo). 


We  assume  (as  in  [7])  that  all  learning  is  dependent  on  the  local  NO  level  and 
there  is  no  direct  dependence  on  the  post-synaptic  depolarization  (see  Discussion). 
The  local  NO  level  that  determines  the  change  in  synaptic  strength  depends  on  the 
details  of  the  post-synaptic  production  of  NO  and  the  pre-synaptic  measurement 
process.  As  the  details  of  this  process  are  as  yet  unknown,  we  use  a  Gaussian  Green’s 
function  (G(x,x'))  to  model  the  spatial  distribution  of  NO.  This  approximation  is 
exact  if  the  NO  is  produced  in  a  short  burst  and  measured  some  fixed  period  of  time 
later.  The  real  process  of  production  and  measurement  will  presumably  be  more 
complicated,  but  we  expect  that  at  least  the  qualitative  features  are  well  described 
by  a  Gaussian  kernel.  With  these  approximations  and  averaging  over  all  patterns, 
the  fixed  point  equation  for  the  learning  dynamics  can  be  written  in  the  form  of 
the  following  non-linear  Fredholm  integral  equation. 


/  dxQ  o(x,  xo)7  dx' G(x,  x')s(x')a(x',  xo)| 


(1) 


In  section  2  we  substitute  the  simplest  non-linear  function  that  still  captures  the 
qualitative  features  observed  in  experiments: 


f[n)  =  ~n{n  —  D){n  —  P)  (2) 

with  D  and  P  two  parameters  that  can  be  determined  directly  from  experiments. 
The  unphysical  values  of  /  for  negative  NO  concentration  merely  stabilise  the  triv¬ 
ial  fixed  point,  and  the  decrease  of  /  after  its  positive  maximum  can  either  be 
interpreted  as  toxicity  of  a  high  concentration  of  NO  or  an  easy  way  of  modelling 
a  saturation  of  the  NO  production.  The  important  features  of  the  function  are  the 
negative  values  for  small  NO  concentration  (LTD)  and  positive  values  (LTP)  for 
high  NO  concentration. 


2  Analysis 

In  appendix  4  we  use  the  assumptions  about  the  non-linearity,  the  spread  of  NO 
(p)  and  the  width  (^r)  of  the  Gaussian  inputs  in  equation  1  to  derive  the  following 
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N„.  ^ 


/?n  - 


expression  in  Fourier  space  for  the  fixed  point  of  the  dynamics, 

s{k)  =  -DPNis{k)e-^^^^ 

+  (D  +  P)N2  j  dk'  s(k')s{k'  -  (3) 

r  r  (k'^-(k'-k")k+k"^)  2 

-Ns  dk'dk"  s{k')s{k")s(k'  +  k"  -  k)e- 

where  the  parameters  are  defined  by: 

N  =  /  \ 

\a/(p  +  (T)(’^-l)(cr2  +  (72  +  l)pcr)  ) 

_  1  /  1  cr^  \ 

~  4  Vp  + (T  (p4- <T)”(p2  +  (n  +  l)p(7)y 

To  derive  the  conditions  under  which  the  dynamics  admit  homogeneous  solutions 
for  the  weights,  we  substitute  the  solution  s(k)  =  Sh6{k)  in  equation  3  and  derive 
the  following  third  order  polynomial  in  Sh : 

s;,  ==  -DPNiSh  +  (£>  +  P)N2sI  -  Nssl  (4) 

Clearly,  the  zero- weight  solution  is  a  fixed  point,  and  non- trivial  homogeneous  so¬ 
lutions  can  be  determined  by  finding  the  two  remaining  roots  of  this  equation. 
Non-trivial  homogeneous  solutions  exist  if  Sh  has  at  least  one  positive  real  solution. 
This  is  the  case  if  the  width  a  of  the  stimuli  satisfies  the  following  constraint: 

P  <  1  _ 

cr{a+3py-^  {D  +  py 

and  in  the  limiting  case,  the  homogeneous  solution  is  given  by 


D-^P 

■\/4^ao 


(P  +  <t) 


+  4pcr 
+  3p<T 


For  widths  larger  than  this  limiting  case  there  will  be  two  fixed  points  of  which 
only  the  largest  is  a  stable  solution.  This  is  evident  from  figure  1  in  which  we  solve 
equation  4  pictorially. 

Even  though  the  couplings  between  the  different  Fourier  components  of  the  weights 
in  equation  3  are  limited,  a  general  periodic  solution  is  not  easily  found.  From 
substituting  s{k)  =  6{k)  -  8(k  -  k*)  in  the  equation  it  can  be  seen  that  in  this 
case  there  is  a  coupling  with  only  one  higher  harmonic  s{k)  =  6{k  -  2k*).  This 
suggests  that  a  (numerical)  iterative  procedure  to  solve  the  integal  equation  could 
be  fruitful.  We  will  attempt  such  a  solution  in  future  work. 


3  Discussion 

We  have  shown  that  the  weight  dynamics  of  a  neural  field  with  a  diffusing  messenger 
can  be  described  by  a  non-linear  Fredholm  integral  equation  and  that  non-zero 
honogeneous  solutions  for  the  weights  are  stable  for  stimuli  wider  than  a  critical 
width.  We  expect  that  in  real  tissue  some  mechanism  would  be  needed  to  prevent 
this  tendency  to  homogenise  the  weights:  either  the  production  of  NO  or  its  diffusion 
through  the  tissue  has  to  be  controlled.  We  expect  such  a  mechanism  to  be  found 
when  the  chemical  process  by  which  NO  is  synthesized  is  more  fully  understood. 
The  current  model  assumes  that  all  learning  is  NO  dependent  and  ignores  any 
influence  of  the  activity  of  the  post-synaptic  neurons.  In  reality,  the  NO  effects  we 
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Figure  1  Solving  for  the  homogeneous  solutions  of  equation  3.  RHS  indicates 
the  Right  Hand  Side  of  this  equation  and  the  solid  line  is  the  Left  Hand  Side.  The 
closed  bullets  show  stable  fixed  points  whereas  the  open  bullets  are  unstable  fixed 
points.  The  critical  size  of  the  inputs  is  given  by  equation  5. 


model  and  the  more  standard  Hebbian  learning  rules  could  operate  concurrently.  In 
that  case  the  interaction  between  the  two  learning  rules  needs  to  be  investigated. 
Furthermore,  we  assumed  that  all  NO  is  produced  in  the  dendrites.  There  is  evi¬ 
dence,  however,  that  the  production  can  take  place  in  the  soma  and  even  the  axons 
[1].  Our  previous  analyses  [5,  6]  considered  production  of  NO  in  the  soma,  and 
showed  different  behaviour  than  that  described  here.  A  full  model  would  include 
all  the  sources  of  NO  and  the  interaction  between  them. 

Lastly,  we  have  modelled  all  time-dependent  effects  of  the  diffusion  with  the  single 
parameter  p.  Interesting  dynamics  could  result  from  including  the  time  dependence 
explicitly,  especially  if  the  inputs  are  temporally  correlated.  We  will  address  this 
issue  in  future  work. 

4  Appendix 

Starting  from  equation  1  we  substitute  the  Gaussian  Green’s  function  with  spread 
p  and  the  inputs  to  derive  the  fixed  point  equation.  First  we  derive  the  NO  concen¬ 
tration  at  position  x  when  a  stimulus  is  presented  at  xq.  The  weights  are  written 
as  an  integral  of  Fourier  components  s{k): 

n{x,XQ)  =  J  dk's{k') 

This  Fourier  integral  over  a  shifted  gaussian  can  be  evaluated  explicitly: 

n{x,  xq)  —  —  J  dk' s(k')e'' 

For  the  averaged  weights’  fixed  point  equation  we  have  to  average  over  all  the 
patterns: 

5(aj)  =  J  f[n{x,Xo)]. 

For  each  of  the  terms  in  /,  the  integral  over  the  distribution  of  patterns  has  to  be 
done  seperately.  Here  we  do  the  linear  case;  the  quadratic  and  cubic  terms  follow 


Krekelherg  &  Taylor:  Modelling  a  Diffusing  Messenger 


229 


by  analogy  but  will  include  interactions  betweeen  the  different  fourier  components. 

Substituting  the  non-linearity  /  from  equation  2,  the  integral  over  the  pattern 

positions  a^o  in  the  linear  term  is  another  Fourier  integral  and  gives: 

—alDP 

In  Fourier  space  this  integral  equation  can  be  written  as: 

sW(k)  =  -DPNis^'^\k)e-l^''‘\ 

The  quadratic  and  cubic  terms  can  be  determined  analogously  which  results  in 

equation  3. 
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The  purpose  of  this  paper  is  to  show  how  fundamental  results  from  variational  approximation 
theory  can  be  exploited  to  design  recurrent  associative  memory  networks.  We  begin  by  stating  the 
problem  of  learning  a  set  of  given  patterns  with  an  associative  memory  network  as  a  hypersurface 
construction  problem,  then  we  show  that  the  associated  approximation  problem  is  weU-posed. 
Characterizing  and  determining  such  a  solution  will  lead  us  to  introduce  the  desired  associative 
memory  network  which  can  be  viewed  as  a  recurrent  radial  basis  function  (RBF)  network  which 
has  as  many  attractor  states  as  there  are  fundamental  patterns  (no  spurious  memories) . 
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1  Introduction 

Learning  from  a  set  of  input-output  patterns,  and  sometimes  from  additional  a  pri¬ 
ori  knowledge,  in  neural  networks  usually  amounts  to  determining  a  set  of  param¬ 
eters  relative  to  a  given  network.  In  the  case  of  feedforward  networks,  the  problem 
may  be  equivalent  to  determining  a  continuous  mapping.  Such  a  problem  can  be 
stated  in  the  framework  of  multivariate  approximation  theory  which,  in  some  cases, 
is  closely  linked  to  regularization  [11].  Concerning  learning  in  recurrent  associative 
memory  networks,  one  usually  has  to  determine  a  set  of  parameters  connected  with 
the  dynamics  of  a  given  network  so  that  the  patterns  (memories)  to  be  stored  are 
stable  states  of  such  a  network  [1,  3,  7].  Complete  characterization  of  the  network 
dynamics  may  be  achieved  by  defining  a  global  Lyapunov  function  (energy)  which 
decreases  during  the  network  evolution  (the  recalling  process)  and  whose  local  min¬ 
ima  are  the  network’s  stable  states  [3,  6,  7]. 

The  procedure  followed  here  for  the  design  of  an  associative  memory  network  is  the 
reverse  of  the  general  approach:  using  methods  from  variational  approximation, 
we  determine  a  function  (or  a  hypersurface)  which  has  as  many  local  maxima  as 
the  number  of  patterns  to  be  memorized,  and  then  define  a  dynamics  on  such  a 
hypersurface  (actually,  the  gradient  dynamics)  which  has  its  stable  states  close  to 
the  patterns  to  be  memorized. 

We  begin  by  proving  that  the  problem  is  well  posed  from  the  approximation  point 
of  view,  and  that  the  desired  function  can  be  explicitly  computed  when  choosing 
convenient  constraints  and  parameters  in  the  variational  formulation  of  the  prob¬ 
lem.  We  finally  show  that  the  derived  gradient  dynamics  can  be  implemented  by  an 
associative  memory  network.  Such  a  network  can  be  viewed  as  a  recurrent  radial 
basis  function  network  which  has  as  many  stable  states  as  there  are  fundamental 
patterns  to  be  stored.  We  also  discuss  how  basins  of  attraction  can  be  shaped  so 
that  no  spurious  stable  states  can  be  encountered  during  the  recalling  process. 

2  Statement  of  the  Problem 

Let  us  consider  a  set  of  patterns,  A  =  {Si,  S2,  Sm}  C  0,  C  IR”.  The  first 
stage  of  the  proposed  approach  is  to  construct  a  hypersurface  defined  by  a  bounded 
smooth  function,  F  :  IR”  — >  IR  whose  unique  local  maxima  are  close  to  the  patterns 
Si,i=  1, ...,  m.  Therefore,  we  will  reduce  the  problem  to  a  functionnal  minimization 
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under  constraints  which  aim  to  translate  the  following  assumptions:  i)  F  should 
interpolate  the  data  {Si,yi),i  =  where  yi  =  F{Si)  are  large  positive  real 

values,  and  F  should  tend  to  zero  outside  of  the  data.  This  constraint  is  reduced 
here  to  the  minimization  of  the  L2  norm  of  F.  ii)  F  should  not  oscillate  between 
the  data.  This  constraint  is  reduced  to  surface  tension  minimization,  which  usually 
amounts  to  minimize  the  L2  norm  of  the  gradient  of  F.  iii)  F  should  be  smooth 
enough.  This  constraint  is  usually  imposed  by  minimizing  the  L2  norm  of  smoothing 
differential  operators  (D^)  to  be  introduced  in  the  following. 

Let  us  define  Cy  =  {/  €  m""  IR,  /  e  C(ni^),  r  >  0,  and  f{Si)  =yi}.  Determining 
an  approximation  of  F  regarding  the  given  three  constraints  can  be  reduced  to  the 
minimization  of  a  cost  functional  J  over  Cy , 

J{f)  =  ^o\\D°ff  +  X4D^ff  +  ...  +  X,Wff  feCy  (1) 

The  operators  £•*  are  defined  by,  f{X)  = 

To  solve  the  minimization  problem,  we  can  either  use  standard  methods  from  the 
calculus  of  variations  by  means  of  the  Euler-Lagrange  equation  [4,  11],  or  use  direct 
methods  based  on  functional  analysis.  Herein,  we  use  the  latter  method  which  is 
closely  related  to  the  reproducing  kernels  method  [5,  15]. 

First,  let  us  consider  the  Hilbert  space,  =  {/  G  ^2(IR”);  ^ 

oo}endowed  with  the  inner  product,  <  f,g  >hp  =  Ylk-o  >k,  where  < 

/,  g  >k=<  D^f,  D^g  >.  When  p  and  n  satisfy  the  condition,  p  >  f  +  r,  r  >  0,then 
i7P(ni”)  is  a  subset  of  [12].  We  assume  this  condition  is  satisfied  in  the 

following. 

To  show  that  the  problem  is  well-posed  (i.e.  it  has  a  unique  solution  in  Cy),  we 
consider  the  following  vector  space,  =  {f  £  L2{R^),Yfk=o  ^k\\F>^ f\\^  ^ 

00,  and  Ajt  >  0}.  It  can  easily  be  shown  [8,  12]  that  F/’^(IR”)  endowed  with  the 
inner  product,  <  f^g  J2k=o  h  <  f,  9  >kA^  ^  Hilbert  space  whose  norm  is, 

WfWn^  =  (</)/  Moreover,  and  are  topologically  equivalent  since 

their  norms  are  equivalent. 

Now,  observe  that  the  norm  of  iJ^(IR”)  is  identical  to  the  functional  J  to  be 
minimized,  so  the  minimization  problem  can  be  rewritten  as, 

{V)  Minimize  J(f)  =||  /  ||^p,/  G  Cy 

(V)  is  then  reduced  to  a  special  class  of  approximation  problems  in  normed  linear 
spaces  [14,  9].  We  will  use  convenient  methods  of  such  theory  to  prove  the  existence 
and  give  a  characterization  of  a  solution  to  the  problem. 

Theorem  1  Given  strictly  positive  real  parameters,  Xk^h  =  l,...,p,  the  problem 
(V)  has  a  unique  solution  F  G  iL^(IR”),  which  is  characterized  by, 

<  F,u  >H^=  A-  <  >>  ^  rohere  are  real  parameters  and 

<  6i,  .>  are  Dirac  operators  supported  by  Si,i  =  1, ...,  m. 

Proof  The  proofs  is  based  on  the  projection  theorem  for  the  existence  and  the 
charactrization  of  the  best  approximation  in  a  linear  manifold  of  a  Hilbert  space 
[8,  9,  14].  In  order  to  use  such  results,  we  first  show  that  Cy  is  a  linear  manifold  of 
This  is  achieved  by  considering  the  set  Co  defined  as, 


^  the  details  of  the  proof  can  be  found  in  [8] 
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Cq  =  {v  £  H^,<  6i,v  >  =  v[Si)  =  0,  i  =  which  is  a  vector  subspace 

of  and  observing  that  Cy  is  a  translation  of  Cq.  Therefore,  the  projection 
theorem  asserts  that  there  is  a  unique  function  F  E  Cy  which  solves  (7^),  and 
since  Cy  is  linear  manifold  associated  to  Co,  can  be  characterized  by  the  relation, 

<  F,u  >Hl~  -  <  6i,u  >,  yu  E  where  jSi  are  real  parameters.  □ 

The  previous  theorem  provides  only  a  characterization  of  the  solution,  but  we  are 
interested  in  determining  explicitly  such  a  solution.  Therefore,  we  consider  the 
differential  operator  P  defined  as  a  combination  of  iterated  Laplacians, 

Pf  =  Ao/  -  Ai  A/  +  ...  -f  {-l)PXpA^Pf,  V/  G  which  leads  us  to  the 

following  theorem. 

Theorem  2  Consider  the  functions  (kernels)  G(.,  ^i),  G(.,  52),  ...,G(.,5m)  as  re¬ 
spective  solutions  of  the  partial  differential  equations  (in  the  distributions  sense), 
PG{X,  Si)  =  6{X  —  Si),i  —  1, ...,  m,yx  E  .Then,  there  exist  unique  parameters 
/?i,/72,  such  that  the  solution  F  of  the  problem  {V)  is, 

F(X)  —  Pi. G{X,  Si), where  (pi,  ...,Pm)  is  the  solution  of  the  linear  system, 
F(Sj)  =  Yli=i  ^  ~  yp  d  — 

Proof  First,  let  us  show  that  any  function  f{X)  =  Pi.G{X,  Si),  with  arbitrary 
PCs,  satisfies  the  characterization  of  theorem  (1).  If  we  consider  any  pair  of  functions 
<f)  and  tp  in  H^,  an  integration  by  parts  in  the  sense  of  distributions  gives  [13], 

<  '0  >k=  (—1)*  <  A^^0, 0  >  .This  relation  allows  us  to  write,  for  any  u  E  H\, 

m  m  p 

i  =  l  f=l  fc=0 

m 

-  > 

i=l 

So  the  function  F  given  in  the  theorem  satisfies  the  characterization  stated  in 
theorem  (1).  Finally,  to  show  that  F  is  the  solution  of  the  problem  (V),  we  have  to 
show  that  F  E  Cy,  which  amounts  to  show  that  there  is  a  unique  solution  (in  PCs) 
to  the  following  linear  system,  F{Si)  =  Si)  =  yi,i  =  1, ...  ,  m.  Since 

P  is  a  linear  combination  of  iterated  Laplacians,  it  is  rotation  and  translation 
invariant  [5,  13],  and  the  kernels  G(.,Si)  are  radial,  centered  respectively  on  Si, 
and  can  be  written  as,  G(X,  Si)  = 

To  show  that  the  linear  system  has  a  unique  solution,  it  suffices  to  show  that  the 
function  G(t)  is  positive  definite  [10].  Since  G  verifies  the  equations, 

AoG(||X  -  Si\\)  -  AiAGdlX  -  Sill)  +  ...  +  (-l)'’ApA*’G(||A:  -  Si||)  =  5(||X  -  5.||) 
it  can  be  shown  that  G(t)  can  be  written  as  [11],  G{t)  =  = 

exp(ita;)dy(a;),  where  is  a  bounded  nondecreasing  function.  G  is  then  writ¬ 
ten  in  the  form  required  by  the  Bochner  theorem  [2]  which  characterizes  positive 
definite  functions.  Therefore,  F{X)  =  A'-G(Ar,  Si)  is  the  solution  of  [V).  □ 

Poggio  h  Girosi  [1 1]  considered  a  similar  variational  problem  for  learning  continu¬ 
ous  mappings  under  a  priori  assumptions.  They  derived  similar  kernels  G{t)  using 
the  Euler-Lagrange  equation.  Setting  Afc  to  some  particular  values  gives  some  il¬ 
lustrations  of  the  solution  F  in  figure  1.  If  we  let  p  tend  to  infinity  in  the  problem 
(P)  as  in  [11]  (considering  only  functions  with  infinite  smoothness  degree),  and 


Labbi:  A  Variational  Approach  to  Associative  Memory 


233 


Figure  1  The  solution  F:  {left)  for  Ao  =  o'  >  0,  Ai  =  1,  and  A^  =  0,  A:  >  2,  we 
have  C?i(<)  =  exp(— a-.|f|)  which  is  not  differentiable  at  the  origin;  (right)  for 
Ao  =  (7^,  Ai  =  2(j^,  A2  =  1,  and  Afe  =  0,  fc  >  3,  we  have  G2(t)  =  Gi  (t)  *  Gi  (t)  (the 
convolution  product)  which  is  differentiable. 


choose  Xk  =  }  then  following  the  same  reasoning  as  the  previous  sections, 

we  obtain,  G{uj)  =  exp{—^),  and  hence  is  a  linear  combination  of  Gaussians 
centered  on  Si’s  with  positive  coefficients  f3i  since,  in  the  linear  system,  Y  =  G.B 
where  {B)i  =  A,  {G)ij  =  exp(-^i^~^|^),  and  (Y),-  >>  0,  if  we  take  a  small 

(7,  and  decompose  the  matrix  G  as  G  =  /  -f-  where  I  is  the  identity  matrix,  and 
—  exp(-  ■  )  for  i  ^  j,  we  can  make  the  approximation,  G~^  —  I  -  Q. 

Therefore,  we  get,  B  —  G~^  Y  {I  —  Q)Y.  Since  the  elements  of  Q  can  be  reduced 
as  desired  by  choosing  a  small  a,  we  get  positive  and  so  the  computed  function 
preserves  the  a  priori  constraints  imposed  on  the  desired  function  (see  figure  1). 

3  Building  an  Associative  Memory  Network 

Let  us  consider  the  solution  F{X)  —  YlJLiPj  exp(— obtained  by  setting 
Xk  =  k\2^c^^  ’  stage  of  our  approach  is  to  build  an  associative  memory 

network  whose  unique  attractor  states  are  close  to  the  patterns  Si.  If  we  consider 
the  gradient  dynamics  applied  to  F,  then  from  any  initial  state  Xo,  we  stabilize 
on  a  local  maximum  of  F  which  is  close  to  a  pattern  Si.  The  closeness  of  the 
actual  maxima  of  F  to  the  patterns  Si  depends  on  the  size  of  a.  For  a  large  cr, 
the  deviation  is  more  important  because  of  the  overlap  between  Gaussians,  while 
it  is  less  important  for  a  small  a.  This  is  a  natural  consequence,  since  large  penalty 
parameters  Xk  would  result  from  a  small  <7,  and  hence  the  constraints  would  be 
better  preserved.  A  discretization  of  the  gradient  dynamics  applied  to  F  gives, 

Xi{i  +  1)  =  -6])^exp(~ll^~;|*ll^),  i  =  l,...,n 

This  dynamics  can  be  implemented  by  a  two-layer  recurrent  RBF  network  (see 
figure  2).  The  first  layer  consists  of  n  units  encoding  the  input/output  pattern,  and 
the  second  -hidden-  layer  consists  of  m  units  computing  radial  functions  centered 
on  the  patterns  Sj ,  Gj  (X)  =  ^  exp(-^^^^^^^).  The  connection  weight  between  an 
input/output  unit  i  and  a  hidden  unit  j  is  Sf  component  of  Si).  The  network 
has  been  used  as  a  part  of  a  system  for  handwritten  character  recognition  and 
reconstruction  [8]. 
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Figure  2  The  associative  memory  network  architecture. 
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In  this  paper  we  present  a  novel  method  for  transforming  nonseparable  nonlinear  programming 
(NLP)  problems  into  separable  ones  using  multilayer  neural  networks.  This  method  is  based  on 
a  useful  feature  of  multilayer  neural  networks,  i.e.,  any  nonseparable  function  can  be  approxi¬ 
mately  expressed  as  a  separable  one  by  a  multilayer  neural  network.  By  use  of  this  method,  the 
nonseparable  objective  and  (or)  constraint  functions  in  NLP  problems  can  be  approximated  by 
multilayer  neural  networks,  and  therefore,  any  nonseparable  NLP  problem  can  be  transformed 
into  a  separable  one.  The  importance  of  this  method  lies  in  the  fact  that  it  provides  us  with  a 
promising  approach  to  using  modified  simplex  methods  to  solve  general  NLP  problems. 
Keywords:  separable  nonlinear  programming,  linear  programming,  multilayer  neural  network. 

1  Introduction 

Consider  the  following  NLP  problem: 

Minimize  p{x)  for  x  E  IR”  (1) 

subject  to  9i{x)  >0  for  i  —  1,  2,  •  •  • ,  m, 
hj(x)  -0  for  j  1,  2,  •••,  r. 

where  p{x)  is  called  the  objective  function,  gi{x)  is  called  an  inequality  constraint 
and  hj{x)  is  called  an  equality  constraint. 

NLP  problems  are  widespread  in  the  mathematical  modeling  of  engineering  design 
problems  such  as  VLSI  chip  design,  mechanical  design,  and  chemical  design.  Unfor¬ 
tunately,  for  the  general  NLP  problems,  computer  programs  are  not  available  for 
problems  of  very  large  size.  For  a  class  of  NLP  problems  known  as  separable  [4,  1], 
some  variation  of  the  simplex  method,  a  well-developed  and  efficient  method  for 
solving  linear  programming  (LP)  problems,  can  be  used  as  a  solution  procedure. 
Separable  nonlinear  programming  (SNLP)  problem  refers  to  a  NLP  problem  where 
the  objective  function  and  the  constraint  functions  can  be  expressed  as  the  sum  of 
functions  of  a  single  variable.  A  SNLP  problem  can  be  expressed  as  follows: 

n 

Minimize  '^Pk{xk)  (2) 

fc=i 

n 

subject  to  >0  for  i  =  1,  2,  •  •  • ,  m, 

*=1 

n 

^  hjk(xk)  =  0  for  j  =  1,  2,  •  •  • ,  r. 
k  =  l 
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An  important  problem  in  mathematical  programming  is  to  generalize  the  simplex 
method  to  solve  NLP  problems.  In  this  paper  we  discuss  how  to  use  multilayer 
neural  networks  to  transform  nonseparable  NLP  problems  into  SNLP  problems. 


2  Transformation  of  Nonseparable  Functions 

Let  q(x)  be  a  multivariable  function.  Given  a  set  of  training  data  sampled  over 
q{x),  we  can  train  a  three-layer  network^  to  approximate  q{x)  [2].  After  training 
the  mapping  q{x)  formed  by  the  network  can  be  regarded  as  an  approximation  of 
q(x)  and  expressed  as  follows: 


(N2  /Ni 

I  -f  bias2j  j  +  bias3i  j  -  7),  (3) 

3=1 


i=l 


where  xj  is  the  input  of  the  7th  unit  in  the  input  layer,  is  the  weight  connecting 
the  zth  unit  in  the  layer  {k  —  1)  to  the  jih  unit  in  the  layer  bias^j  is  the  bias  of 
the  jih  unit  in  the  layer  k,  f  is  the  sigmoidal  activation  function,  Nk  is  the  number 
of  units  in  the  layer  k,  a,  /?,  and  7  are  three  constants  which  are  determined  by  the 
formula  used  for  normalizing  training  data. 

Introducing  auxiliary  variables  631,  621,  ^22,  •  •  •,  and  b2N2  Eq.  (3),  we  can 
obtain  the  following  simultaneous  equation 

f  qibsi)  =  Oi-i  ^{f{b3i) -j) 

N2 

bsi  —  ^  '^3ijf{b2j)  —  bias3i 
j=i 


iVi 

^  621  -  ^  W2iiXi  -  bias2i 

x  =  l 


(4) 


hN2  -  'Y^'^2N2i^i  =  biaS2N2, 

i  =  l 

where  bj-j  for  Ar  —  2,  3,  j  =  1,  2,  •  •  • ,  Nk,  and  Xi  for  z  =  1,  2,  ■  •  • ,  #1,  are  variables. 
We  see  that  all  of  the  functions  in  Eq.  (4)  are  separable.  The  importance  of  Eq.  (4) 
lies  in  the  fact  that  it  provides  us  with  an  approach  to  approximately  expressing 
nonseparable  functions  as  separable  ones  by  multilayer  neural  networks. 

In  comparison  with  conventional  function  approximation  problems  the  training  task 
mentioned  above  is  easier  to  be  dealt  with.  The  reasons  for  this  are  that  (a)  an 
arbitrary  large  number  of  sample  data  for  training  and  test  can  be  obtained  from 
q{x),  and  (b)  the  goal  of  training  is  to  approximate  a  given  function  5(0?),  so  the 
performance  of  the  trained  network  can  be  easily  checked. 

3  Transformation  of  Nonseparable  NLP  problems 

According  to  the  locations  of  nonseparable  functions  in  NLP  problems,  nonsepa¬ 
rable  NLP  problems  can  be  classified  into  three  types:  (I)  only  the  objective  func¬ 
tion  is  nonseparable  function;  (II)  only  the  constraint  functions  are  nonseparable 

^For  simplicity  of  description,  we  consider  only  three-layer  networks  throughout  the  paper.  The 
results  can  be  extended  to  M-layer  (M  >  3)  networks  easily. 
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functions;  and  (III)  both  the  objective  and  constraint  functions  are  nonseparable 
functions. 

For  Type  I  nonseparable  NLP  problems,  we  only  need  to  transform  the  objective 
function  into  separable  one.  Replacing  the  objective  function  with  its  approximation 
in  Eq.  (4),  we  can  transform  a  Type  I  nonseparable  NLP  problem  into  a  SNLP 
problem  as  follows: 

Minimize  a /3{f{b^^)  —  j)  (5) 

subject  to  631  -  ^  =  bias§^ 

j=i 

^21  -  X! 

i  —  1 

Ni 

~  —  biaS^jsfO 

i=l 
n 

'^Qiki^k)  >0  for  2  -  1,  •  •  •,  m, 
k  =  l 
n 

=  0  for  1,  r, 

*=1 

where  the  “O”  superscript  refers  to  quantities  on  the  objective  function. 

Using  the  similar  approach  mentioned  above,  we  can  transform  the  Type  II  and 
III  nonseparable  NLP  problems  into  SNLP  problems,  and  deal  with  unconstrained 
nonlinear  programming  problems. 

Suppose  that  the  accuracy  of  approximating  nonseparable  functions  is  good  enough 
for  a  given  problem.  The  transformed  SNLP  problems  are  equivalent  to  their  origi¬ 
nal  nonseparable  NLP  problems,  since  the  transformations  only  change  the  expres¬ 
sion  forms  of  the  objective  function  and  (or)  the  constraint  functions.  The  accuracy 
of  approximation  may  be  effected  by  many  factors  such  as  the  network  size,  the 
learning  algorithms  and  the  number  of  training  data.  There  are  several  methods 
for  dealing  with  this  problem,  for  example  see  [3,  8,  7].  We  can  use  these  results  to 
guide  our  training  and  get  a  good  accuracy. 

4  Analysis  of  Complexity 

Now  let  us  analyze  complexity  of  the  original  and  transformed  problems.  Suppose 
that  each  nonseparable  function  is  approximated  by  the  network  with  same  number 
of  hidden  units  (TV).  Also  suppose  that  there  are  s  (s  <  m)  nonseparable  inequality 
and  t  {t  <r)  nonseparable  equality  constraint  functions  in  Type  II  and  III  problems. 
The  numbers  of  variables  and  constraints  in  original  and  transforming  problems  are 
shown  in  Table  1. 

From  Table  1  we  see  that  as  the  number  of  hidden  units  grows  both  the  number 
of  variables  and  the  number  of  constraints  in  the  transformed  problems  increase. 
Therefore,  it  is  desirable  that  each  nonseparable  function  is  approximated  by  a 
network  with  as  few  number  of  hidden  units  as  possible.  But  on  the  other  hand 
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Problem 

No.  of  variables 

No.  of  constraints 

Original 

n 

m-\-r 

I 

n~\-  N  1 

m  +  r  -f-  A"  -i-  1 

II 

n  +  sN  ^  tN  {s  -\- 1) 

(m  ~  s)  -f  (r  —  t) 

-\-sN  -f  tA  T  s  + 1 

III 

n-\-N-^sN  +  tN 

(m  -  s) {r  -  t)  N 

+sA  -f  f  A  -f  (1  4-  s  -f  t) 

Table  1  The  numbers  of  variables  and  constraints  in  the  original  and  transformed 
problems. 


it  may  become  more  difficult  for  a  smaller  network  (e.g.,  fewer  number  of  hidden 
units)  to  approximate  a  nonseparable  function.  There  exists  a  trade-off  between  the 
complexity  of  the  transformed  SNTP  problems  and  the  approximating  capability 
of  neural  networks. 

5  Simulation  Results 

Consider  the  following  simple  NLP  problem: 

Minimize  2  —  sin^  xi  sin^  X2  (6) 

subject  to  0.5  <  <  2.5 

0.5  <  X2  <  2.5 

Clearly,  this  is  a  Type  I  nonseparable  NLP  problem.  A  three-layer  perceptron  with 
two  input,  ten  hidden,  and  one  output  units  is  used  to  approximate  the  objective 
function.  The  training  data  set  consists  of  524  input-output  data  which  are  gathered 
by  sampling  the  input  space  [0.5,  2.5]  x  [0.5,  2.5]  in  a  uniform  grid.  The  network 
is  trained  by  the  back-propagation  algorithm  [6].  In  this  simulation,  the  learning  is 
considered  complete  when  the  sum  of  squared  error  between  the  target  and  actual 
outputs  gets  less  than  0.05.  Replacing  the  objective  function  with  its  approximation 
formed  by  the  network,  we  obtain  a  SNLP  problem. 

Approximating  the  sigmoidal  activation  function  in  the  SNLP  problem  over  the 
interval  [—16,  16]  via  14  grid  points,  we  obtain  an  approximate  SNLP  problem. 
Solving  this  problem  with  the  simplex  method  with  the  restricted  basis  entry  rule 
[4],  we  obtain  the  solution:  x*  =  1.531488  and  x^  =  1.595567.  If  the  sigmoidal 
activation  function  is  approximated  over  the  interval  [-16,  16]  via  40  grid  points, 
we  can  obtain  a  better  solution:  x^  =  1.564015  and  x^  =  1.569683.  Solving  this 
problem  directly  with  the  Powell  method  [5],  we  obtain  a  more  accurate  solution: 
x\  —  1.57079  and  —  1.57079.  It  should  be  noted  that  there  exits  a  trade-off 
between  the  accuracy  of  the  approximation  of  SNLP  problems  and  the  number  of 
grid  points  for  each  variable. 

In  general,  several  local  solutions  may  exist  in  a  SNLP  problem.  But  only  one  solu¬ 
tion  can  be  obtained  by  solving  the  approximate  SNLP  problem  using  the  simplex 
method  with  the  restricted  basis  entry  rule.  It  has  been  shown  that  if  the  objective 
function  is  strictly  convex  and  all  the  constraint  functions  are  convex,  the  solution 
obtained  by  the  modified  simplex  method  is  sufficiently  close  to  the  global  optimal 
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solution  of  the  original  problem  by  choosing  a  small  grid.  Unfortunately,  the  SNLP 
problems  transformed  by  our  method  are  non-convex  since  the  sigmoidal  activation 
function  is  non-convex.  In  such  case,  even  though  optimality  of  the  solution  can  not 
be  claimed  with  the  restricted  basis  entry  rule,  good  solutions  can  be  obtained  [1]. 

6  Conclusion  and  Future  Work 

We  have  demonstrated  how  multilayer  neural  networks  can  be  used  to  transform 
nonseparable  functions  into  separable  ones.  Applying  this  useful  feature  to  nonlinear 
programming,  we  have  proposed  a  novel  method  for  transforming  nonseparable  NLP 
problems  into  separable  ones.  This  result  opens  up  a  way  for  solving  general  NLP 
problems  by  some  variation  of  the  simplex  method,  and  makes  connection  between 
multilayer  neural  networks  and  mathematical  programming  techniques.  As  future 
work  we  will  perform  simulations  on  large-scale  nonseparable  NLP  problems. 
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The  piirpose  of  this  paper  is  to  present  a  probabilistic  theory  of  self- organising  networks  based 
on  the  results  published  in  [l].  This  approach  allows  vector  quantisers  and  topographic  mappings 
to  be  treated  as  different  limiting  cases  of  the  same  theoretical  framework.  The  full  theoreticed 
machinery  allows  a  visual  cortex- like  network  to  be  built. 

1  Introduction 

The  purpose  of  this  paper  is  to  present  a  generalisation  of  the  probabilistic  approach 
to  the  static  analysis  of  self-organising  neural  networks  that  appeared  in  [1].  In  the 
simplest  case  the  network  has  two  layers:  an  input  and  an  output  layer.  An  input 
vector  is  used  to  clamp  the  pattern  of  activity  of  the  nodes  in  the  input  layer, 
and  the  resulting  pattern  of  individual  “firing”  events  of  the  nodes  in  the  output 
layer  is  described  probabilistically.  Finally,  an  attempt  is  made  to  reconstruct  the 
pattern  of  activity  in  the  input  layer  from  knowledge  of  the  location  of  the  firing 
events  in  the  output  layer.  This  inversion  from  output  to  input  is  achieved  by 
using  Bayes’  theorem  to  invert  the  probabilistic  feed-forward  mapping  from  input 
to  output.  A  network  objective  function  is  then  introduced  in  order  to  optimise 
the  overall  network  performance.  If  the  average  Euclidean  error  between  an  input 
vector  and  its  corresponding  reconstruction  is  used  as  the  objective  function,  then 
many  standard  self-organising  networks  emerge  as  special  cases  [1,  2]. 

In  section  2  the  network  objective  function  is  introduced,  in  section  3  a  simpler 
form  is  derived  which  is  an  upper  bound  to  the  true  objective  function,  and  in 
section  4  the  derivatives  with  respect  to  various  parameters  of  this  upper  bound 
are  derived.  Finally,  in  section  5  various  standard  neural  networks  are  analysed 
within  this  framework. 

2  Objective  Function 

The  basic  mathematical  object  is  the  objective  function  D,  which  is  defined  as 

m  . 

Yl  P’’  (*)  Pf  (yi.  y2.  ■  ■  ■  Pr  (x'lyi,  y2,  ■  •  • ,  yn)  ||x  -  x'll^ 

yi,y2,-  -,yn=i 

(1) 

where  x  is  the  input  vector  and  x'  is  its  reconstruction,  (yi, y2,  ■  ■  ■ ,  Yn)  are  the 
locations  in  the  output  layer  of  n  firing  events,  and  ||x  -  x'||^  is  the  Euclidean 
distance  between  the  input  vector  and  its  reconstruction.  The  various  probabili¬ 
ties  arise  as  follows:  f  dxPi(x)  (•  ■  •)  integrates  over  the  training  set  of  input  vec¬ 
tors,  Pr(yi,y25  •  •  •  jy«|x)  is  the  joint  probability  of  n  firing  events  at  locations 
(yi,  y2,  ■  ■  ■ ,  Yn),  Pr  (x'|yi,  y2,  •  •  • ,  Yn)  is  the  Bayes’  inverse  probability  that  input 
vector  x'  is  inferred  as  the  cause  of  the  n  firing  events,  Yj"  „  ■)  sums 

over  all  possible  locations  (on  an  assumed  rectangular  lattice  of  size  m)  of  the  n 
firing  events.  The  order  in  which  the  n  firing  events  occurs  will  be  assumed  not  to  be 
observed,  so  that  Pt  (yi,Y2,  •  "  ,Yn |x)  is  a  symmetric  function  of  (yi ,  y2,  •  •  • ,  yn). 
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A  simplifying  assumption  will  be  made  where  the  observed  firing  events  are  assumed 
to  be  statistically  independent,  so  that 

Pr  (yi ,  y2,  •  •  • ,  yn  |x)  =  Pr  (yi  |x)  Pr  (ya |x)  •  •  ■  Pr  (y„  |x)  (2) 

Normally,  a  simple  form  for  Pr  (y|x)  is  used  such  as 

m 

Pr  (y|x)  =  Q  (x|y)  /  ^  Q  (x|y') 

y'=l 

where  <5(x|y)  >  0,  which  guarantees  that  the  normalisation  condition 

J^^j^Pr(y|x)  =  1  holds.  However,  a  more  general  expression  will  be  used  here 
based  on  the  definition 

p,  (yix-  V)  =  <3(^ly)W(y-)  (3) 

which  implies  that  J2y£Ar{y>)  Pr(y|x;y')  =  1,  where  J\f{y')  is  the  set  of  node  lo¬ 
cations  that  lie  in  a  predefined  “neighbourhood”  of  y'.  This  neighbourhood  can  be 
used  to  introduce  “lateral  inhibition”  between  the  firing  neurons  if  the  expression 
for  Pr  (y|x)  is  written  as  a  sum  over  Pr  (y|x;y')  as  follows 

=  i  E  PKy|*;y')  =  i^QWy)  E  ^ 


y'gA/'-l(y)  y'€A/'-Ty) 


'6Ar(yO  Q  (xly") 


where  (y)  is  the  “inverse  neighbourhood”  of  y  defined  as 

J\f~^  (y)  ^  {y'  •  y  ^  ^(y0}j  ^  number  of  locations  in  the  output 

layer  that  have  a  non-zero  neighbourhood  size  (M  =  'f2y:j\f(y)j^iii  1)*  This  expression 
for  Pr  (y  |x)  is  identical  to  one  that  arises  in  the  context  of  optimising  a  special  class 
of  mixture  distributions  [3],  and  it  satisfies  Pr(y|x)  =  1-  The  expresson  for 

Pr  (y|x)  in  (4)  may  readily  be  generalised  to  the  case  where  the  neighbourhood  is 
non-uniformly  weighted.  The  expression  for  T)  in  (1)  and  the  expression  for  Pr  (y|x) 
in  (4)  are  the  two  basic  ingredients  in  the  theory  of  self-organising  neural  networks 
presented  in  this  paper. 

There  is  one  further  ingredient  that  proves  to  be  very  useful.  Pr  (y  |x)  will  be  allowed 
to  “leak”  as  follows  [1,  3] 

Pr(y|x)-^  Pi(y|y')Pr(y'W  (5) 

y'e£-'(y) 

where  Pr  (y|y0  is  the  amount  of  probability  that  leaks  from  location  y'  to  location 
y,  and  (y)  is  the  “inverse  leakage  neighbourhood”  of  y  defined  as  C~^  (y)  = 
{y^  :  y  G  /C  (y')}  ,  where  C  (y')  is  the  “leakage  neighbourhood”  of  y'.  The  purpose  of 
leakage  is  to  allow  the  network  output  to  be  “damaged”  in  a  controlled  way,  so  that 
when  the  network  is  optimised  it  automatically  becomes  robust  with  respect  to  such 
damage.  For  instance,  if  each  node  is  allowed  to  leak  probability  onto  its  neighbours 
in  the  output  layer,  then  when  the  network  is  optimised  the  node  properties  become 
topographically  ordered. 

3  Simplify  the  Objective  Function 

The  network  objective  function  can  be  simplified  to  yield  [1] 


»  m 

D  =  2jdxFi{x)  ^  Pr(yi,y2, 


yn|x)||x-x'(yi,y2, 
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where  x' (yi,  y2,  •  •  • ,  Yn)  is  a  ‘Teference  vector”  defined  as  x' (yi,y2,  •  • -jy^)  = 
f  dx  Pr  (x|yi,  y2,  •  ■  ■ ,  Yn)  x.  If  x^  (yi,  y2,  •  •  • ,  yn)  is  treated  as  an  independent  vari¬ 
able  (i.e.  its  definition  is  ignored)  then  minimisation  of  D  with  respect  to  it  yields 
the  result  x'  (yi,y2,  •  ■  ■ ,  Yn)  =  /dxPr  (x|yi,y2,  •  •  •  ,yn)  x,  which  is  consistent  with 
how  it  should  have  been  defined  anyway.  This  convenient  trick  allows  the  inverse 
probability  Pr  (x|yi,y2,  • '  ■  jYn)  to  be  eliminated  henceforth,  provided  that  D  in 
(6)  is  tacitly  assumed  to  be  minimised  with  respect  to  x'  (yi,y2)  *  •  ’  >  Yn)- 
D  can  be  further  simplified  to  the  form  D  =  Di  D2  —  Ds  [2] ,  where 

‘If 

Di  =  -  /(ixPr(x)^Pr(y|x)||x-x'(y)f  (7) 

y=i 

D2  =  f  dxPr(x)  f;  Pr(yi,y2|x)(x-x'(yi)).(x-x'(y2)) 

yi,y2=i 

m  1  ” 

Da  =  2  ^  Pr(yi,y2,-"yn)  x'(yi,y2,  ■  • -yn)  -  -  ^x'(y,) 

yi,y2,-yn=i  i=i 

To  obtain  this  result  Pr  (yi ,  y2,  •  •  • ,  yn|x)  has  been  assumed  to  be  symmetric  under 
permutation  of  the  locations  yi,y2j ' ' '  j  Yn  (e.g.  the  locations,  but  not  the  order  of 
occurrence  of  the  n  firing  events  is  known).  If  the  independence  assumption  in  (2) 
is  now  invoked,  then  D2  may  be  simplified  to 

2 

/  dx  Pr  (x)  X-  f;  Pr  (y  |x)  x'  (y)  (8) 

y=i 

The  dependence  of  D3  on  the  n-argument  reference  vectors  x' (yi,y2,  •  • -yn)  is 
inconvenient,  because  the  total  number  of  such  reference  vectors  is  C?(|m|”),  where 
|m|  is  the  total  number  of  output  nodes  (|m|  =  mi  m2  •  •  -  m^  for  a  d-dimensional 
rectangular  lattice  of  size  m).  However,  the  positivity  of  Ds,  together  with  D  = 
Di  D2  —  D3,  will  be  used  to  obtain  an  upper  bound  to  D  as  D  <  Di  D2, 
which  depends  only  on  1-argument  reference  vectors  x^(y).  The  total  number  of 
l-argument  reference  vectors  is  equal  to  |m|. 

4  Differentiate  the  Objective  Function 

In  order  to  implement  an  optimisation  algorithm  the  derivatives  of  Di  and  D2 
with  respect  to  the  various  parameters  must  be  obtained.  The  expressions  that  are 
encountered  when  differentiating  are  rather  cumbersome,  but  they  have  a  simple 
structure  which  can  be  made  clear  by  the  introduction  of  the  following  notation 


^y,y'  =  Pr(y'|y) 

Pyy  =  Fi{y'\x;y) 

Py  =  I2y‘eM-‘(y)  -^y'.y 

Ly'^ypyl 

dy  =  X  -  x'  (y) 

(Td)y  =  X)y/g£(-y)  Ly^yldyl 

(Dld)y  =  J2y'eJi/'(y)  ^y,y' 

ey  s  ||x-x'(y)|p 

(Te)y  =  5^y/g£(y)  Lyyey! 

{PLe)^  =  5Dy'6A/'(y)  ^y,y' 

(P'^PLe)^  =  Yly'e^^-^(y)  ^y',y 

a=Ey=l(i^p)yCly 

ord^Er=i(^^d)^ 

(9) 
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which  allows  (5)  to  be  written  as  Pr(y|x)  — ^  ^  (L'^p)^.  Di  and  D2  may  be  differ¬ 
entiated  with  respect  to  x'  (y)  to  obtain 

r  = - f  dx  Pr  (x)  dy 

dx'(y)  nM  j  ^  ^ 

_dp2_  _4(n-l)  r  d  (10) 

ax'(y)  nM2  J  v  M  i';y 

Di  and  D2  may  be  functionally  varied  with  respect  to  log  Q  (x|y)  to  obtain 

6Di  =  “It  /<lxPr(x)^61ogQ(x|y)(py(Le)y-(P^PI.e)y)  (11) 

nM  J 

J  dxPr(x)f;«logQ(x|y)  (py  (Ld)^  -  (P^Pi 


SDi  = 


If  Q  (x|y)  is  assumed  to  be  a  sigmoid  function 

Q  (x|y)  =  1/  (1  +  exp  (-W  (y)  ■  X  -  6  (y))) 
then  the  derivatives  of  Di  and  with  respect  to  the  “weight”  vector  w  (y)  and 
“bias”  b  (y)  may  be  written  as 

HD.  2  /.  .  . 


J  dxPr  (x) 


x(l-<3(x|y)) 


4  (n  —  1) 


dx  Pr  (x) 


Py(id)y-(P^PPd)j 


x(l-g(x|y)) 


The  gradients  in  (10)  and  (12)  may  be  used  to  implement  a  gradient  descent  algo¬ 
rithm  for  optimising  Di-\-  D2,  which  then  leads  to  a  least  upper  bound  on  the  full 
objective  function  D  {=  Di  D2  —  D3). 

5  Special  Cases 

Various  standard  results  that  are  special  cases  of  the  model  presented  above  are 
discussed  in  the  following  subsections. 

5.1  Vector  Quantiser  and  Topographic  Mapping 

Assume  n  =  1  so  that  only  1  firing  event  is  observed  so  that  D2  =  D3  =  Q, 
y  €  A/"(y^)Vy,y'  so  that  the  neighbourhood  embraces  all  of  the  output  nodes, 
and  probability  leakage  of  the  type  given  in  (5)  is  allowed.  Then  D  reduces  to 
D  =  2  f  c/xPr(x)X]^iPr(y|y(x))||x-x'(y)||^  [1],  which  leads  to  a  behaviour 
that  is  very  similar  to  the  topographic  mappings  described  in  [5],  where  Pr(yly') 
now  corresponds  to  the  topographic  neighbourhood  function.  In  the  limit  where 
Pr  (y|y  (x))  =  6y  y(x)  this  reduces  to  the  criterion  for  optimising  a  vector  quantiser 
[4]. 

5.2  Visual  Cortex  Network 

A  “visual  cortex” -like  network  can  be  built  if  the  full  theoretical  machinery  pre¬ 
sented  earlier  is  used.  This  network  has  many  of  the  emergent  properties  of  the 
mammalian  visual  cortex,  such  as  orientation  maps,  centre-on/surround-off  detec¬ 
tors,  dominance  stripes,  etc  (see  e.g.  [7]  for  a  review  of  these  phenomena).  There  is 
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an  input  layer  with  a  pattern  of  activity  representing  the  input  vector,  an  output 
layer  with  nodes  firing  in  response  to  feed-forward  connections  (i.e.  a  “recogni¬ 
tion”  model),  and  a  mechanism  for  reconstructing  the  input  from  the  firing  events 
via  feed-back  connections  (i.e.  a  “generative”  model).  The  output  layer  has  lateral 
inhibition  implemented  as  in  (4);  the  neighbourhood  of  node  y'  has  an  associated 
inhibition  factor  1/  5^y//gjv"(y')  Q  overall  inhibition  factor  for  node  y 

is  Sy'eJV-i(y)  (^/  Sy"€Jvr(yO  Q  (^|y^0)  >  which  is  the  sum  of  the  inhibition  factors 
over  all  nodes  y'  that  have  node  y  in  their  neighbourhood.  This  scheme  for  intro¬ 
ducing  lateral  inhibition  is  discussed  in  greater  detail  in  [3].  The  leakage  introduced 
by  Pr(y|y')  induces  topographical  ordering  as  usual. 

In  the  limit  n  — >  1  where  Di  is  dominant,  this  network  behaves  like  a  topographic 
mapping  network,  except  that  the  output  layer  splits  up  into  a  number  of  “do¬ 
mains”  each  of  which  is  typically  a  lateral  inhibition  length  in  size,  and  each  of 
which  forms  a  separate  topographic  mapping.  These  domains  are  seamlessly  joined 
together,  so  no  domain  boundaries  are  actually  visible.  In  the  limit  n  — >■  oo  where 
D2  is  dominant,  this  network  approximates  its  input  as  a  superposition  of  reference 
vectors  Pr  (y|x)  x' (y)  (see  (8)).  Thus  the  network  is  capable  of  explaining 

the  input  in  terms  of  multiple  causes. 

6  Conclusions 

A  single  theoretical  framework  has  been  shown  to  describe  a  number  of  standard 
self-organising  neural  networks.  This  makes  it  easy  to  understand  the  relationship 
between  these  neural  networks,  and  it  provides  a  useful  framework  for  analysing 
their  properties. 
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In  this  contribution  a  new  method  for  supervised  training  is  presented.  This  method  is  based  on 
a  recently  proposed  root  finding  procedure  for  the  numerical  solution  of  systems  of  non-linear 
algebraic  and/or  transcendental  equations  in  H".  This  new  method  reduces  the  dimensionality  of 
the  problem  in  such  a  way  that  it  can  lead  to  an  iterative  approximate  formula  for  the  computation 
of  n  -  1  connection  weights.  The  remaining  connection  weight  is  evaluated  separately  using  the 
final  approximations  of  the  others.  This  reduced  iterative  formula  generates  a  sequence  of  points  in 
which  converges  quadraticaUy  to  the  proper  n  - 1  connection  weights.  Moreover,  it  requires 
neither  a  good  initial  guess  for  one  connection  weight  nor  accurate  error  function  evaluations.  The 
new  method  is  applied  on  some  test  cases  in  order  to  evaluate  its  performance. 

Subject  classification:  AMS(MOS)  65K10,  49D10,  68T05,  68G05. 

Keywords:  Numerical  optimization  methods,  feed  forward  neural  networks,  supervised  training, 
back-propagation  of  error,  dimension-reducing  method. 

1  Introduction 

Consider  a  feed  forward  neural  network  (FNN)  with  I  layers,  I  G  [1,^']-  The  error 
is  defined  as  ek{t)  =  4(0  -  2/jb'(0.  for  ^  =  1,2,...,  K,  where  4(0  is  the  desired 
response  at  the  kih  neuron  of  the  output  layer  at  the  input  pattern  t,  y^{t)  is  the 
output  at  the  kih  neuron  of  the  output  layer  L.  If  there  is  a  fixed,  finite  set  of  input- 
output  cases,  the  square  error  over  the  training  set  which  contains  T  representative 

cases  is:  t  T  K 

tzzl  t  =  l  fc  =  l 

The  most  common  supervised  training  algorithm  for  FNNs  with  sigmoidal  non¬ 
linear  neurons  is  the  Back-Propagation  (BP),  [4].  The  BP  minimizes  the  error 
function  E  using  the  Steepest  Descent  (SD)  with  fixed  step  size  and  computes  the 
gradient  using  the  chain  rule  on  the  layers  of  the  network.  BP  converges  too  slow  and 
often  yields  suboptimal  solutions.  The  quasi-Newton  method  (BFGS)  [2],  converges 
much  faster  than  the  BP  but  the  storage  and  computational  requirements  of  the 
Hessian  for  very  large  FNNs  make  its  use  impractical  for  most  current  machines. 
In  this  paper,  we  derive  and  apply  a  new  training  method  for  FNNs  named  Dimen¬ 
sion  Reducing  Training  Method  (DRTM).  DRTM  is  based  on  the  methods  studied 
in  [3]  and  it  incorporates  the  advantages  of  Newton  and  SOR  algorithms  (see  [4]). 

2  Description  of  the  DRTM 

Throughout  this  paper  IR"  is  the  n-dimensional  real  space  of  column  weight  vectors 
w  with  components  wi,  W2, .  •  • ,  Wn',  (y;  represents  the  column  vector  with  com- 
ponents  yi .  1/2 ,  •  •  ■ ,  ym ,  .  ^2 ,  •  •  • ,  ^it  I  ft  £(w)  denotes  the  partial  derivative  of  E{w) 

with  respect  to  the  ith  variable  tn,-;  g{w)  =  (gi{w), .  ■  •  ,yn(«'))  defines  the  gra,di- 
ent  VE{w)  of  the  objective  function  E  ai  w  while  H  —  [Hij]  defines  the  Hessian 
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V^E(w)  of  E  B.iw\  A  denotes  the  closure  of  the  set  A  and  E{wi, . . . , 

. . . ,  Wn)  defines  the  error  function  obtained  by  holding  u;i, . . . ,  Wi^i,  Wi^i, 
fixed. 

The  problem  of  training  is  treated  as  an  optimization  problem  in  the  FNN’s  weight 
space  (i.e.,  n~dimensional  Euclidean  space).  In  other  words,  we  want  to  find  the 
proper  weights  that  satisfy  the  following  system  of  equations  ; 

=  z-l,...,n.  (2) 

In  order  to  solve  this  system  iteratively  we  want  a  sequence  of  weight  vectors 
{wP},p  =  0,1,...  which  converges  to  the  point  w*  =  (it;*, . . . ,  r/;*)  e  C  IR" 

of  the  function  E.  First,  we  consider  the  sets  to  be  those  connected  com¬ 
ponents  of  containing  it;*  on  which  dnQi  ^  0,  for  i  =  l,...,n  respec¬ 

tively.  Next,  applying  the  Implicit  Function  Theorem  (see  [4,  3])  for  each  one 
of  the  components  gi  we  can  find  open  neighborhoods  A\  C  IR”~^  and  ^2*  C 
IR  of  the  points  y*  =  . . . ,  ii;*_j)  and  it;*  respectively,  such  that  for  any 

y  —  ■  ‘,Wn-i)  6  A*  there  exist  unique  mappings  ipi  defined  and  continu¬ 
ous  in  Al  such  that  :  =  ipi{y)  €  A^^,  and  gi{y;<pi{y))  =:  0,  z  =  l,...,n. 

Moreover,  the  partial  derivatives  dj(pi,j'=  l,...,n-l  exist  in  A^  for  each  (pi, 
they  are  continuous  in  Al  and  they  are  given  by  : 


dj<Pi{y) 


djOi  {y;pi{y)) 

^n9i{y\  ^i{y))  ’ 


z  =  1, . . .,  n,  j  =  1, . . .,  n  -  1 . 


(3) 


Working  exactly  as  in  [3],  we  utilize  Taylor’s  formula  to  expand  ipi{y),  about  y^ . 
By  straightforward  calculations,  utilizing  approximate  values  for  gi{-)  and  djgi{-)  = 
dJ^E  (see  [5],  where  error  estimates  for  these  approximations  can  also  be  found)  we 
obtain  the  following  iterative  scheme  for  the  computation  of  the  n  —  l  components 
of  w*  : 

=  ^  +  ApVj,,  p=0,l,...,  (4) 

where  j/P  =  [tuf],  Vp  =  [t;.]  =  -  <■"]  and  the  elements  of  the  matrix  Ap  are  : 

]  _  [gifa'’  gnfa”  +  g„(yP;  mg.'-)' 

[s'>(!/P;n^n’' +  he„) -jr,(j/P;wP'*)  ffn(y'’;  +  he„)  -  ff„(yP;  w^’")  ’ 

(5) 

with  w^’’  =  (piiy’^),  h  a  small  quantity  and  e,  the  j-th  unit  vector.  After  a  de¬ 
sired  number  of  iterations  of  (4),  say  p  =  m,  the  nth  component  of  w*  can  be 
approximated  by  means  of  the  following  relation  : 


-E 


9n{y^  +  hej;w^>^)  ~  gn(y^;<>^) 
9n{y^;Wn''^  -j- her,)-  9n(y^\Wn’'') 


Note  that  the  iterative  formula  (4)  uses  the  matrices  Ap  and  Vp.  The  matrix  Ap 
constitutes  the  reduced-Hessian  of  our  network  and  its  components  incorporate 
components  of  the  Hessian  but  are  evaluated  at  different  points.  The  matrix  Vp 
uses  only  the  points  wP>^  (z  ==  1, . . . ,  n  -  1)  and  instead  of  the  gradient  values 
employed  in  Newton’s  method.  A  proof  for  the  convergence  of  (4)  and  (6)  can  be 
found  in  [6]. 

Relative  procedures  for  obtaining  w*  can  be  constructed  by  replacing  it;„  with  any 
one  of  the  components  i^i, . . . ,  iy„_i,  for  example  Wint-  The  above  described  method 
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FR  PR  BFGS  DRTM 


IT 

FF 

IT 

FF 

IT 

FF 

IT 

FF 

ASG 

(0.3, 0.4) 

F 

F 

F 

F 

F 

F 

5 

20 

100 

(-1.-2) 

F 

F 

F 

F 

14 

274 

7 

28 

140 

(-1,10) 

F 

F 

F 

F 

14 

285 

7 

28 

140 

(0.2, 0.2) 

F 

F 

F 

F 

F 

F 

5 

20 

100 

(2,1) 

F 

F 

F 

F 

13 

298 

5 

20 

100 

(0.3, 0.3) 

F 

F 

F 

F 

F 

F 

5 

20 

100 

(-1.2, 1.2) 

F 

F 

F 

F 

F 

F 

7 

28 

140 

Table  1  Comparative  results  for  Example  1. 


does  not  require  the  expressions  (pi  but  only  the  values  which  are  given  by  the 
solution  of  the  one-dimensional  equations  •)  =  0.  So,  by  holding 

yP  =  (tpP, . . . ,  fixed,  we  can  solve  the  equations  :  gi{y^ ;  rf )  =  0,  i  =  1, . .  .,n  , 

for  an  approximate  solution  rf  in  the  interval  (a,  b)  with  an  accuracy  D.  In  order 
to  solve  the  one-dimensional  equations,  we  employ  a  modified  bisection  method 
described  in  [3,  12]  and  given  by  the  following  formula  : 

=  lyP  +  sgn^(u;^)  5 p  =  0, 1,...,  (7) 

with  =  a,  q  =  sgnip{a)  (h-a)  and  where  sgn  defines  the  well  known  sign  function. 
This  method  computes  with  certainty  a  root  when  sgniJ{w^)sgIl^p{wP)  =  -1  (see 
[12]).  It  is  evident  from  (7)  that  the  only  computable  information  required  by  this 
method  is  the  algebraic  signs  of  the  function 

A  high-level  description  of  the  new  algorithm  can  be  found  in  [8] . 

3  Simulation  Results 

Here  we  present  and  compare  the  behavior  of  the  DRTM  with  other  popular  meth¬ 
ods  on  some  artificially  created  but  characteristic  situations.  For  example,  it  is 
common  in  FNN  training  to  take  minimization  steps  that  increase  some  weights  by 
large  amounts  pushing  the  output  of  the  neuron  into  saturation.  Moreover,  in  vari¬ 
ous  small  and  large  scale  neural  network  applications  the  error  surface  has  flat  and 
steep  regions.  It  is  well  known  that  the  BP  is  highly  inefficient  in  locating  minima 
in  such  surfaces.  In  the  following  examples,  the  gradient  is  evaluated  using  finite 
differences  for  the  DRTM  and  analytically  for  all  the  other  methods. 

Example  1  The  objective  function’s  surface  has  flat  and  steep  regions 
10 

^(y^) 5fi(u;i,u;2)  =  2 -f  2f  -  (e'®' -I- .  (8) 

z=:l 

System  (8),  which  is  a  well-known  test  case,  (Jennrich  and  Sampson  Function)  (see 
[9]),  has  a  global  minimum  at  wi  =  W2  =  0.2578  —  In  Table  1  we  present  results 
obtained  by  applying  the  nonlinear  conjugate  gradient  methods  Fletcher- Reeves 
(FR)  and  Polak-Ribiere  (PR)  and  the  quasi-Newton  Broyden-Fletcher-Goldfarb- 
Shanno  (BFGS)  method  with  the  corresponding  numerical  results  of  DRTM.  In  this 
Table  IT  indicates  the  total  number  of  iterations  required  to  obtain  w*  (iterations 
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BP  DRTM 


MN 

STD 

SUC 

MN 

STD 

SUC 

MAS 

100.4 

77.3 

38.5 

2.3 

1.2 

72.5 

67.8 

Table  2  Comparison  of  Back-propagation  with  DRTM  for  Example  2. 


limits  500);  FE  the  total  number  of  function  evaluations  (and  derivatives)  and 
ASG  the  total  number  of  algebraic  signs  of  the  components  of  the  gradient  that 
are  required  for  applying  the  iterative  scheme  (7).  Because  of  the  difficulty  of  the 
problem  FR  and  PR  failed  to  converge  in  all  the  cases  (marked  with  an  F  in  the 
table).  The  results  are  mixed  with  the  BFGS  method.  Especially,  when  we  are  close 
to  the  minimum  BFGS  leaves  the  appropriate  region  moving  to  wrong  direction  in 
order  to  minimize  the  objective  function. 

Example  2  The  objective  function's  surface  is  oval  shaped  and  bent. 

We  can  artificially  create  such  a  surface  by  training  a  single  neuron  with  sig¬ 
moid  non-linearity  using  the  patterns  {-6, 1},  {-6.1, 1},  (-4.1, 1},  {-4, 1},  (4, 1}, 
{4.1, 1},  {6, 1},  {6.1, 1}  for  input  and  {0},  {0},  {0.97},  {0.99},  {0.01},  {0.03},  {!}, 
{1}  for  output.  The  weights  wi,W2  take  values  in  the  interval  [—3,3]  x  [—7.5,  7.5]. 
The  global  minimum  is  located  at  the  center  of  the  surface  and  there  are  two  valleys 
that  lead  to  local  minima.  The  step  size  for  the  BP  was  0.05.  The  initial  weights  were 
formed  by  spanning  the  interval  [-3,  3]  in  steps  of  0.05  and  the  interval  [-7.5, 7.5] 
in  steps  of  0.125. 

The  behavior  of  the  methods  is  exhibited  in  Table  2,  where  MN  indicates  the  mean 
number  of  iterations  for  simulations  that  reached  the  global  minimum;  STD  the 
standard  deviation  of  iterations;  SUC  the  percentage  of  success  in  locating  the 
global  minimum  and  MAS  the  mean  number  of  algebraic  signs  that  are  required 
for  applying  the  iterative  scheme  (7).  Note  that  for  DRTM,  since  finite  differences 
are  used,  two  error  function  evaluations  are  required  in  each  iteration.  BP  succeeds 
to  locate  the  global  minimum  when  initial  weights  take  values  in  the  intervals 
G  [-0.8, 1.5]  and  W2  €  [-2.5,  2.5].  On  the  other  hand,  DRTM  is  less  affected 
by  the  initial  weights.  In  this  case  we  exploit  the  fact  that  we  are  able  to  isolate 
the  weight  vector  component  most  responsible  for  unstable  behavior  by  reducing 
the  dimension  of  the  problem.  Therefore,  DRTM  is  very  fast  and  possesses  high 
percentage  of  success. 

4  Conclusion  and  Further  Improvements 

This  paper  describes  a  new  training  method  for  FNNs.  Although  the  proposed 
method  uses  reduction  to  simpler  one-dimensional  equations,  it  converges  quad- 
ratically  to  n  —  1  components  of  an  optimal  weight  vector,  while  the  remaining 
weight  is  evaluated  separately  using  the  final  approximations  of  the  others.  Thus, 
it  does  not  require  a  good  initial  estimate  for  one  component  of  an  optimal  weight 
vector.  Moreover,  it  is  at  the  user’s  disposal  to  choose  which  will  be  the  remaining 
weight,  according  to  the  problem.  Since  it  uses  the  modified  one-dimensional  bisec¬ 
tion  method,  it  requires  only  that  the  algebraic  signs  of  the  function  and  gradient 
values  be  correct.  It  is  also  possible  to  use  this  method  in  training  with  block  of 
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weights  using  different  remaining  weights.  In  this  case,  the  method  can  lead  to  a 
network  training  and  construction  algorithm.  This  issue  is  currently  under  devel¬ 
opment  and  we  hope  to  address  it  in  a  future  communication. 

Note  that  in  general  the  matrix  of  our  reduced  system  is  not  symmetric.  It  is 
possible  to  transform  it  to  a  symmetric  one  by  using  proper  perturbations  [6].  If 
the  matrix  is  symmetric  and  positive  definite  the  optimal  weight  vector  minimizes 
the  objective  function.  Furthermore,  DRTM  appears  particularly  useful  when  it  is 
difficult  to  evaluate  the  gradient  values  accurately,  as  well  as  when  the  Hessian  at 
the  optimum  is  singular  or  ill-conditioned  [8] . 
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In  this  contribution  a  new  training  method  is  proposed  for  neural  networks  that  axe  based  on 
neurons  whose  output  can  be  in  a  particular  state.  This  method  minimises  the  well  known  least 
square  criterion  by  using  information  concerning  only  the  signs  of  the  error  function  and  inaccurate 
gradient  values.  The  algorithm  is  based  on  a  modified  one-dimensional  bisection  method  and  it 
treats  supervised  training  in  networks  of  neurons  with  discrete  output  states  as  a  problem  of 
minimisation  based  on  imprecise  values. 
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1  Introduction 


Consider  a  Discrete  Multilayer  Neural  Network  (DMNN)  consisting  of  L  layers, 
in  which  the  first  layer  denotes  the  input,  the  last  one,  L,  is  the  output,  and  the 
intermediate  layers  are  the  hidden  layers.  It  is  assumed  that  the  (/-l)-th  layer  has 
Ni-i  units.  These  units  operate  according  to  the  following  equations  : 


Ni. 


net]  =  Y, 


w 


i=l 


(1) 


where  net^j  is  the  net  input  to  the  jth  unit  at  the  /th  layer,  w^j  is  the  connection 
weight  from  the  ith  unit  at  the  (/  —  l)-th  layer  to  the  jf’th  unit  at  the  /th  layer,  yl 
denotes  the  output  of  the  zth  unit  belonging  to  the  /th  layer,  Oj  denotes  the  threshold 
of  the  jth  unit  at  the  /th  layer,  and  a  is  the  activation  function.  In  this  paper  we 
consider  units  where  (T(net\)  is  a  discrete  activation  function.  We  especially  focus 
on  units  with  two  output  states,  usually  called  binary  or  hard-limiting  units  [1], 
i.e.  (T^netj)  =  'HruE\  if  ne/j  >  0,  and  ''false'’  otherwise. 

Although  units  with  discrete  activation  function  have  been  superseded  to  a  large 
extent  by  the  computationally  more  powerful  units  with  analog  activation  function, 
still  DMNNs  are  important  in  that  they  can  handle  many  of  the  inherently  binary 
tasks  that  neural  networks  are  used  for.  Their  internal  representations  are  clearly 
interpretable,  they  are  computationally  simpler  to  understand  than  networks  with 
sigmoid  units  and  provide  a  starting  point  for  the  study  of  the  neural  network 
properties.  Furthermore,  when  using  hard-limiting  units  we  can  understand  better 
the  relationship  between  the  size  of  the  network  and  the  complexity  of  the  training 
[2].  In  [3]  it  has  been  demonstrated  that  DMNNs  with  only  one  hidden  layer,  can 
create  any  decision  region  that  can  be  expressed  as  a  finite  union  of  polyhedral  sets 
when  there  is  one  unit  in  the  input  layer.  Moreover,  artificially  created  examples 
were  given  where  these  networks  create  non  convex  and  disjoint  decision  regions. 
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Finally,  discrete  activation  functions  facilitate  neural  network  implementations  in 
digital  hardware  and  are  much  less  costly  to  fabricate. 

The  most  common  feed  forward  neural  network  (FNN)  training  algorithm,  the 
back-propagation  (BP)  [4]  that  makes  use  of  the  gradient  descent,  cannot  be  ap¬ 
plied  directly  to  networks  of  units  with  discrete  output  states,  since  discrete  ac¬ 
tivation  functions  (such  as  hardlimiters)  are  non-differentiable.  However,  various 
modifications  of  the  gradient  descent  have  been  presented  [5,  6,  7].  In  [8]  an  approx¬ 
imation  to  gradient  descent,  the  so-called  pseudo-gradient  training  method,  was 
proposed.  This  method  uses  the  gradient  of  a  sigmoid  as  a  heuristic  hint  instead  of 
the  true  gradient.  Experimental  results  validated  the  effectiveness  of  this  approach. 
In  this  paper,  we  derive  and  apply  a  new  training  method  for  DMNNs  that  makes 
use  of  the  gradient  approximation  introduced  in  [8].  Our  method  exploits  the  impre¬ 
cise  information  regarding  the  error  function  and  the  approximated  gradient,  like 
the  pseudo-gradient  method  does,  but  it  has  an  improved  convergence  speed  and 
has  potential  to  train  DMNNs  in  situations  where,  according  to  our  experiments, 
the  pseudo-gradient  method  fails  to  converge. 

2  Problem  Formulation  and  Proposed  Solution 

We  consider  units  with  two  discrete  output  states  and  we  shall  use  the  convention 
/  (or  — /)  for  “false”  and  t  (or  -|-t)  for  “true”,  where  /,  t  are  real  positive  numbers 
and  f  <  t,  instead  of  the  classical  0  and  1  (or  —1,  and  -f-l).  Real  positive  values 
prevent  units  from  saturating,  give  to  the  logic  “false”  some  power  of  influence 
over  the  next  layer  of  the  DMNN,  and  help  the  justification  of  the  approximated 
gradient  value  which  we  shall  employ. 

First,  let  us  define  the  error  for  a  discrete  unit  as  follows:  ej{t)  =  dj{t)  — 
for  j  =  where  dj(t)  is  the  desired  response  at  the  jth  neuron  of  the 

output  layer  at  the  input  pattern  yf{t)  is  the  output  at  the  kth.  neuron  of  the 
output  layer  L.  For  a  fixed,  finite  set  of  input-output  cases,  the  square  error  over 
the  training  set  which  contains  T  representative  cases  is; 

T  T  Nl 

£;=E^w  =  E&-w-  (2) 

t=l;=l 

The  idea  of  the  pseudo-gradient  was  first  introduced  in  training  discrete  recurrent 
neural  networks  [9,  10]  and  extended  to  DMNNs  [8].  The  method  approximates 
the  true  gradient  of  the  error  function  with  respect  to  the  weights,  i.e.  VE{w)y  by 
introducing  an  analog  set  of  values  for  the  outputs  of  the  hidden  layer  units  and 
the  output  layer  units. 

Thus,  it  is  assumed  that  yj  in  (1)  can  be  written  as  yj  =  ^s(netj)^ ,  where 
5-(x)=  “true”  if  x>  0.5,  and  “/a/se”  otherwise,  if  s(-)  is  defined  in  [0, 1].  If  s(-)  is 
defined  in  [-1, 1]  then  cr(x)=  “true”  if  x>  0,  and  “/a/se”  otherwise. 

Using  the  chain  rule,  the  pseudo-gradient  is  computed  : 


(3) 

=  (di  - 

s' {net  j) 
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these  relations  s'(netj)  is  the  derivative  of  the  analog  activation 

function. 

By  using  real  positive  values  for  “true”  and  “false”  we  ensure  that  the  pseudo¬ 
gradient  will  not  reduce  to  zero  when  the  output  is  “false”.  Note  also  that  we  do 
not  use  a'  which  is  zero  everywhere  and  non-existent  at  zero.  Instead,  we  use  s' 
which  is  always  positive,  so  Sj  gives  an  indication  of  the  direction  and  magnitude 
of  a  step  up  or  down  as  a  function  of  netj  in  the  error  surface  E. 

However,  as  pointed  out  in  [8],  the  value  of  the  pseudo-gradient  is  not  accurate 
enough,  so  gradient  descent  based  training  in  DMNNs  is  considerably  slow  when 
compared  with  BP  training  in  FNNs. 

In  order  to  alleviate  this  problem  we  propose  an  alternative  to  the  pseudo-gradient 
training  method  procedure.  To  be  more  specific,  we  propose  to  solve  the  one¬ 
dimensional  equation  : 

E{wi, to?,  E{w°i, . . u)?,  ,  u)“)  =  0, 

for  wi  keeping  all  other  components  of  the  weight  vector  in  their  constant  values. 
Now,  if  wi  is  the  solution  of  the  above  equation,  then  the  point  defined  by  the 
vector  {wi,W2, . . . ,  possesses  the  same  error  function  value  with  the  point 
so  it  belongs  to  the  same  contour  line  of  w^.  Assuming  that  the  error  function 
curves  up  from  w*  in  all  directions,  we  can  claim  that  any  point  which  belongs  to 
the  line  with  endpoints  and  {wi,W2, . . .  ,w^)  possesses  smaller  error  function 
value  than  these  endpoints.  With  this  fact  in  mind  we  can  now  choose  such  a  point, 
say,  for  example  =  lyj  -f  7  (i&i  —  lyf),  7  E  (0, 1),  and  solve  the  one-dimensional 
equation  : 

E(wI,W2,  . . . ,  E{wl,  wf+i, . . . ,  -  0, 

for  W2  keeping  all  other  components  in  their  constant  values.  If  W2  is  the  solution 
of  this  equation  then  we  can  obtain  a  better  approximation  for  this  component  by 
taking  j  {w2  e  (0, 1). 

Continuing  in  a  similar  way  with  the  remaining  components  of  the  weight  vector 
we  obtain  the  new  vector  =  (tyj, . .  and  replace  the  initial  vector  by 

w^.  The  procedure  can  then  be  repeated  to  compute  and  so  on  until  the  final 
estimated  point  is  computed  according  to  a  predetermined  accuracy.  So,  in  general 
we  want  to  find  the  parameter  x  (a  weight  or  threshold)  that  satisfies  : 

E{x^+\  . ,  4)  =  0, 

by  applying  the  modified  bisection  (see  [12,  13])  in  the  interval  (oj,  6i)  within  accu¬ 
racy  d  : 

=  X?  +  Csgn  {E{zP)  -  E(z°))  /2?+‘,  p  =  0. 1, . . .  ,  riog2((6.-  -  aOd-^l, 
where  the  notation  [•]  refers  to  the  smallest  integer  not  less  than  the  real  number 
quoted  and  z°  =  “i>  ^.'+1.  =  (xi+\  . . . ,  .  4+i. 

. . . ,  x‘),  C  =  sgnE(z°){bi  -  a;),  a;  =  xf  -  i{H-  sgn3,\B(x*)}/ii,  6,-  =  a,-  +  hi.  If  an 
iteration  of  the  algorithm  fails  we  switch  to  the  pseudo-gradient  training  method. 
So,  the  justification  of  the  new  procedure  is  based  on  the  heuristic  justification  of  the 
pseudo-gradient  which  can  be  found  in  any  one  of  [8,  9,  10].  A  formal  justification 
of  the  proposed  procedure  in  case  of  differentiable  objective  functions  can  be  found 
in  [11]. 
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BP  New  method 


MN 

STD 

MNE 

MN 

STD 

MNE 

MAS 

SAS 

a) 

561 

550.4 

0.0396 

40.6 

4.2 

0.0000008 

239.6 

54.12 

b) 

18121 

3048.7 

0.49 

28.5 

13.43 

0.45 

20673 

9310.9 

Table  1  Experimental  results  a)  XOR,  b)  sina7cos2ir. 


3  Experimental  Results 

Here  we  present  and  compare  the  behaviour  of  the  new  training  method  with  the 
BP  [4]  and  the  pseudo-gradient  training  method  [8]  for  the  XOR  problem  and 
training  an  1-10-1  network  to  approximate  the  function  sin  x  cos  2x  (Table  1).  In  all 
problems  7  =  0,5,  d  =  10"^^,  h  =  10  and  no  pseudo-gradient  subprocedure  has  been 
applied  with  the  proposed  method  in  order  to  get  more  fair  evaluation.  MN  indicates 
the  mean  number  of  iterations;  STD  the  standard  deviation  of  iterations;  MNE  the 
mean  value  of  the  error;  MAS  the  mean  number  of  algebraic  signs  required  for  the 
bisection  scheme  and  SAS  the  standard  deviation  of  the  required  algebraic  signs. 
The  results  are  for  10  simulation  runs,  for  the  same  initial  weights;  the  maximum 
number  of  iterations  was  set  to  2000,  the  weights  were  initialised  in  the  interval 
[—10, 10]  and  the  step  size  for  BP  was  set  to  the  standard  value  0.75.  For  the  XOR 
the  thresholds  were  set  as  follows:  “irue”  =  0.8  and  “false”  =  0.2.  Under  the 
same  conditions  the  pseudo-gradient  training  needed  more  than  2000  iterations  to 
converge.  The  frequency  with  which  the  algorithm  became  trapped  in  local  minima 
seems  to  be  about  the  same  as  for  BP  for  binary  tasks.  We  also  used  the  new 
method  in  training  DMNN  to  learn  smooth  functions.  One  hidden  layer  of  hard- 
limiting  units  and  one  output  unit  with  linear  activation  function  was  used  in  all 
our  experiments.  We  did  not  manage  to  train  DMNNs  using  the  pseudo-gradient 
training  method  due  to  oscillations,  although  various  step  sizes  and  different  discrete 
activation  functions  have  been  tried.  With  the  new  algorithm  and  discrete  activation 
functions  such  as  0.5  for  “true”  and  —0.5  for  “false”  DMNNs  were  trained  as  fast 
as,  and  often  faster  than,  BP  trained  FNNs  until  E  <  0,5  (over  21  input/output 
cases).  After  this  error  bound,  the  convergence  speed  was  reduced  due  to  saturation 
problems. 

However,  it  is  worth  noticing  the  difference  in  the  behaviour  between  BP  and  the 
new  method.  Back-propagation  trained  FNNs  exhibit  a  greater  tendency  to  fit 
closely  data  with  higher  variation  than  data  with  low  variation.  On  the  other  hand, 
although  DMNNs  do  not  produce  smooth  functions,  they  learn  the  general  trend  of 
the  data  values  and  therefore  might  be  more  useful  than  FNNs  when  there  is  noise 
in  the  data  and  the  error  goal  can  be  set  so  high  that  the  network  does  not  have 
to  fit  all  the  target  values  perfectly.  Situations  like  this  usually  occur  in  system 
identification  and  control  (see  [14]). 

4  Conclusion  and  Further  Improvements 

This  paper  describes  a  new  training  method  for  DMNNs.  The  method  does  not 
directly  perform  gradient  evaluations.  Since  it  uses  the  modified  one-dimensional 
bisection  method  it  requires  only  that  the  algebraic  signs  of  the  function  and  gra- 
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dient  values  be  correct;  so  it  can  be  applied  to  problems  with  imprecise  function 
and  gradient  values.  The  method  can  also  be  used  in  training  with  block  of  network 
parameters,  for  example  train  the  entire  network,  then  the  weights  to  the  output 
layer  and  the  thresholds  of  the  hidden  units,  etc.  We  have  tested  such  configurations 
and  the  results  were  very  promising,  providing  faster  training. 
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A  methodology  for  investigating  the  invariant  structural  characteristics  of  the  different  approx¬ 
imations  produced  by  Hopfield  networks  is  presented.  The  technique  exploits  the  description  of 
the  dynamics  of  a  network  as  a  Generating  series  which  relates  the  output  of  a  network  to  the 
past  history  of  inputs.  Truncations  of  a  Hopfield  Generating  series  are  approximations  to  unknown 
dynamics  to  a  specified  order.  As  a  truncated  series  has  finite  Lie  rank,  a  local  minimal  realisation 
can  be  formulated.  This  realisation  has  a  dimension  whose  lower  bound  is  determined  by  the 
relative  order  of  the  network  and  whose  upper  boimd  is  determined  by  the  order  of  truncation. 
The  maximal  dimension  of  the  minimal  realisation  is  independent  of  the  number  of  nodes  in  the 
network. 

Keywords:  Hopfield  networks,  nonlinear  dynanoics,  realisations,  minimality. 

1  Introduction 

Trained  recurrent  networks  are  commonly  used  to  provide  models  of  an  unknown 
nonlinear  dynamic  system.  The  representations  are  in  the  form  of  state-space  models 
which  are  usually  characterised  as  sets  of  nonlinear  differential  equations.  However, 
different  combinations  of  network  weight  parameters  often  produce  comparable 
approximation  capabilities.  Thus  a  fundamental  problem  is  the  interpretation  of 
these  representations.  One  approach  to  this  problem  is  to  attempt  to  reduce  the 
state-space  model  to  a  minimal  form  as  this  is  the  description  to  which  all  other 
representations  are  related  by  diffeo-morphisms. 

In  practice,  network  model  building  is  concerned  with  producing  a  suitable  ap¬ 
proximation  of  the  unknown  dynamic  system,  thus  what  is  required  is  a  method 
for  specifying  the  order  of  the  approximation  and  for  producing  the  corresponding 
minimal  realisation  which  has  the  same  approximation  capability.  The  input-output 
behaviour  of  a  trained  network  can  be  formulated  as  a  formal  power  series  in  non- 
commutative  variables.  This  formal  power  series,  the  Generating  series  [2],  has  a 
minimal  realisation  if  its  Lie  rank  is  finite  [1].  If  the  infinite  Hopfield  Generating 
series  which,  in  general,  is  not  of  finite  Lie  rank  can  be  truncated,  it  can  be  used  to 
produce  minimal  realisations  whose  input-output  behaviour  match  that  of  the  net¬ 
work  up  to  a  specified  arbitrary  order.  This  is  equivalent  to  producing  the  minimal 
realisation  of  the  unknown  dynamics  to  a  specified  order  of  approximation. 

In  this  article  the  tools  for  constructing  minimal  realisations  of  truncated  Gener¬ 
ating  series  are  applied  to  Hopfield  recurrent  networks.  These  tools  were  originally 
described  by  Fliess  [1].  The  approach  was  further  developed  by  Jacob  &  Oussous 
[3], 

2  Hopfield  Networks  and  their  Local  Solutions 

The  Lie  derivatives,  Aq  and  Ai,  that  define  the  state-space  of  the  single-input 
(m  =  1)  single-output  (SISO)  Hopfield  RNN  are 

N  N  d  ^  d 

Ao  = 

i  j  ^  i  ^ 
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where  <j{xi)  is  the  output  of  the  ith.  node  in  a  single  layer  of  N  hidden  nodes,  /ex¬ 
acts  as  a  time  constant  of  the  ith  node,  Wij  is  the  weight  between  node  i  and  j, 
and  jiU  is  the  weighted  input  u  into  node  i.  It  is  assumed  that  the  cr  nonlinearity 
is  the  tanh  function  and  that  the  output  of  the  network  is  y  =  h(x)  = 

The  Generating  series  solution  of  this  specific  linear  analytic  system  is  formed  by 
the  Peano-Baker  iteration  of  the  state-space  differential  equations.  It  can  be  shown 
to  be  [2] 

m 

j/(«)  =  5  =  Y, 

where  5  is  a  mapping  between  the  free  monoid  Z* ,  constructed  from  the  alphabet 
set  2o,  Zi,  into  IR  and  the  subscript  [Xo  means  evaluated  at  the  initial  conditions  of 
the  state  vector.  Each  word,  Zj^Zj^..Zj^  corresponds  to  the  iterated  integral 

/•<  pTl  pTh-l 

/  Uj^{Tk)dTk--dr2dTi 

Jo  Jo  Jo 

defined  recursively  on  its  length, (uq  is  defined  to  be  unity).  The  coefficient  of  the 
word  Zj^Zj^..Zj^  is  the  iterated  Lie  derivative  operating  on  the  output  h 

and  evaluated  at  the  initial  conditions.  The  Generating  series  is  a  causal  functional 
expansion  about  a  point  and  is  valid  local  to  this  position  in  state  space,  for  a  short 
time  and  for  small  input  u. 

Two  systems  are  locally  equivalent  if  and  only  «/ their  Generating  series  match.  In 
this  framework  any  training  algorithm  for  a  recurrent  network  can  be  viewed  as  a 
method  for  adjusting  the  coefficients  of  words  to  produce  this  matching  by  altering 
the  contribution  of  the  weights,  to  each  term,  in  the  Generating  series. 

3  Network  Minimal  Realisations 

The  Generating  series  form  of  the  Hopfield  network  is  an  infinite  series.  Truncation 
of  this  series  produces  an  approximation  of  the  local  input-output  behaviour  of  the 
network  and  therefore  of  the  unknown  system. 

To  produce  the  minimal  realisation  of  this  approximation  depends  upon  identifying 
the  structural  characteristics  of  a  truncated  Generating  series  which  does  not  include 
a  constant  term.  This  latter  constraint  can  always  be  satisfied  by  considering  the 
dynamics  from  a  specific  position  in  state  space.  The  truncated  Generating  series 
of  arbitrary  order  k  can  be  expressed  in  terms  of  a  Lie-Hankel  matrix  [3].  A  Lie- 
Hankel  matrix,  LHs,  of  a  Generating  series  S  is  an  (infinite)  array  whose  rows  are 
indexed  by  a  totally  ordered  basis  of  the  Lie  algebra  of  L{Z),  and  the  columns 
are  indexed  by  Z* .  The  finiteness  of  the  rank  of  this  matrix  determines  whether 
a  corresponding  minimal  state-space  realisation  exists,  while  the  magnitude  of  the 
rank  governs  the  dimension  of  the  realisation  [1]. 

As  the  basis  of  L{Z)^  the  Lyndon  basis  (the  specific  Lie  polynomials.  Pi)  can  be 
chosen  [3,  4]. 

For  example,  if  the  Generating  series  of  a  n  state  Hopfield  network  is  truncated  so 
that  the  length  of  the  words  <  2,  then  the  Lyndon  words  are  the  set  {zq,  zi.zqZi] 
and  the  corresponding  Lie  polynomials  Pi  are  {zq,  zi,[zo,  zi]}  where  [2^05-2^1]  is  the 
Lie  bracket  of  the  words  zq,  zi  and  is  defined  as  zqZi  —ziZq.  The  Lie-Hankel  matrix 
is  shown  in  Table  1  whose  elements  are  evaluated  at  a  point  in  the  state  space  of 
the  Hopfield  network.  Similar  analyses  can  be  made  for  truncated  Generating  series 


Manchanda  &  Green:  Local  Realisations  of  Networks 


257 


e 

Zq 

Zl 

Zo 

Ao(h) 

^4p) 

AiAoih) 

Zl 

A,{h) 

AoAiih) 

Al{h) 

[zo,zi] 

AiAo(h)  -  AoAi{h) 
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order 

rank 

1 

1 

2 

3 

3 

5 

4 

8 

Table  2 


of  higher  order.  The  maximum  rank,  as  a  function  of  truncation  order  is  shown  in 
Table  2. 

The  rank,  and  therefore  the  dimension  of  the  minimal  realisation,  is  not  determined 
by  the  number  of  hidden  nodes  in  the  network  as  the  elements  of  the  Lie-Hankel 
matrix  are  defined  for  an  arbitrary  number  of  hidden  nodes.  The  rank  reduces  if 
certain  network  weights  result  in  either  linear  row  interdependence  or  rows  having 
zero  entries.  This  latter  condition  can  occur  if  the  networks  have  relative  order 
greater  then  unity  as  then  particular  terms  in  the  Generating  series  are  zero.  The 
lower  bound  on  the  rank  of  the  Lie-Hankel  matrix  is  determined  by  the  relative 
order  of  the  network. 

3.1  Construction  of  the  Minimal  Realisation 

Lyndon  words  can  be  used  to  construct  a  basis  set  of  polynomials  Qj .  The  basis  set 
Q  represents  the  local  coordinates  of  the  minimal  state-space  realisation.  The  set 
Q  is  used  to  reconstruct  the  truncated  Generating  series  S.  The  series  is  expressed 
in  terms  of  linear  combinations  and  shuffle  products  of  the  elements  Qj  of  the  basis 
set  Q.  Thus,  for  the  truncated  Hopfield  network  of  relative  order  unity,  the  set  Q 
consists  of  the  elements  Qi  —  zq,  Q2  —  zi  and  Q3  =  zqZi. 

The  truncated  series  is  reconstructed  from  the  basis  set  Q  by  using  a  modification  of 
the  algorithm  proposed  by  Jacob  Sz  Oussous  [3]  to  deal  with  Generating  series  with 
arbitrary  coefficients  of  words.  The  algorithm  iteratively  searches  for  and  identifies 
proper  left  factors  of  the  truncated  series. 

The  vector  fields  are  reconstructed  in  terms  of  linear  combinations  and  shuffle 
products  of  the  elements  of  Q  and  translated  into  the  state  space  coordinates  q. 
Thus,  in  the  length  two  example,  the  vector  fields  of  the  minimal  realisation  are 

Ao  —  ~ — ,  Ai  —  - h  qi-^ — 

oqi  dq2  oqs 

The  output  function  is  given  by 

y  —  {AoAi{h)  —  AiAo{h))q3  A -Ai{h)ql Ai(h)q2 
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-\-AiAQ{h)qiq2  +  -A‘l{h)q\  +  Ao{h)qi 

4  Discussion 

This  article  is  concerned  with  local  approximations  to  unknown  nonlinear  dynamics 
up  to  a  specified  order.  The  approximation  order  is  in  terms  of  the  length  of  the 
Generating  series  that  represents  the  input-output  behaviour  of  a  trained  network. 
A  trained  recurrent  network  can  closely  approximate,  in  a  local  sense,  an  unknown 
dynamic  when  the  network  Generating  series  is  similar  to  that  of  the  unknown 
system.  It  should  be  noted  that  the  unknown  dynamic  system  may  not  necessarily 
be  represented  by  a  suitable  global  model  and  therefore  one  should  seek  local  models 
in  the  first  instance.  The  Generating  series,  like  the  Taylor  series,  is  a  /oca/ functional 
expansion. 

In  this  article  only  an  upper  and  lower  bound  on  the  rank  of  the  Lie-Hankel  matrix 
is  described.  The  upper  bound  is  determined  by  the  length  of  the  truncated  Gen¬ 
erating  series  and  is  directly  related  to  the  order  of  approximation  of  the  unknown 
dynamics  by  the  Hopfield  network. 

The  rank  determines  the  dimension  of  the  minimal  realisation.  The  dimension  of 
the  minimal  realisation  is  independent  of  the  number  of  hidden  nodes  in  the  Hopfield 
network.  The  network  trajectory  evolves  locally  on  a  submanifold  of  the  Hopfield 
state-space. 

The  local  minimal  realisation  of  a  truncated  Generating  series  of  a  Hopfield  network 
is  a  set  of  polynomial  differential  equations  and  an  output  which  is  polynomial 
in  the  states.  The  state-space  dynamics  are  fixed  and  do  not  depend  upon  the 
position  in  the  original  Hopfield  state  space.  The  minimal  state  space  dynamics 
are  input  driven,  with  zero  initial  conditions.  The  state  space  dynamics  reflect  the 
local  influence  of  the  system  input.  However,  the  output  map  is  dependent  upon 
the  position  in  the  Hopfield  state  space.  These  observations  on  the  form  of  the  local 
minimal  realisations  imply  that  any  two  networks  which  are  used  to  approximate 
an  unknown  system  have  the  same  minimal  state-space  dynamics  and  differ  only 
in  the  form  of  the  output  function. 
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We  show  that  in  supervised  learning  from  a  particular  data  set  Bayesian  model  selection,  based 
on  the  evidence,  does  not  optimise  generalization  performance  even  for  a  leamable  linear  prob¬ 
lem.  This  is  achieved  by  examining  the  finite  size  effects  in  hyperparameter  assignment  from  the 
evidence  procediire  eind  its  effect  on  generalisation.  Using  simulations  we  corroborate  our  ana¬ 
lytic  results  and  examine  an  alternative  model  selection  criterion,  namely  cross-validation.  This 
niimerical  study  shows  that  in  the  learnable  linear  case  for  finite  sized  systems  leave  one  out 
cross-validation  estimates  correlate  more  strongly  with  optimal  performance  than  do  those  of  the 
evidence. 

1  Introduction 

The  problem  of  supervised  learning,  or  learning  from  examples,  has  been  much 
studied  using  the  techniques  of  statistical  physics  (  see  e.g.  [7]).  A  major  advantage 
of  such  studies  over  the  usual  approach  in  the  statistics  community  is  that  one  can 
examine  the  situation  where  the  fraction  (a)  of  the  number  of  examples  (p)  to  the 
number  of  free  parameters  (N)  is  finite.  This  contrasts  with  the  asymptotic  (in  o:) 
treatments  found  in  the  statistics  literature  (see  e.g.  [6]).  However,  one  draw  back  of 
the  statistical  physics  approach  is  that  it  is  based  on  the,  so  called,  thermodynamic 
limit  where  one  allows  N  and  p  to  approach  infinity  whilst  keeping  a  constant.  A 
quantity  is  said  to  be  self  averaging  if  its  variance  over  data  sets  of  examples  tends  to 
zero  in  the  thermodynamic  limit.  We  show  that  in  Bayesian  model  selection  based  on 
the  evidence,  conclusions  drawn  from  the  thermodynamic  results  are  qualitatively 
at  odds  with  the  finite  size  behaviour. 

In  the  supervised  learning  scenario  one  is  presented  with  a  set  of  data 

7^  =  {(y,(x'^),x^):p==l..p} 

consisting  of  p  examples  of  an  otherwise  unknown  teacher  mapping  denoted  by 
the  distribution  P(yt  \  x).  Furthermore,  we  assume  that  the  N  dimensional  input 
space  is  sampled  with  probability  -P(x).  The  learning  task  is  to  use  this  data  D 
to  set  the  Ng  parameters  w  of  some  model  (or  student)  such  that  it’s  output, 
2/5 (x),  generalizes  to  examples  not  contained  in  the  training  data,  P.  Often  this  is 
achieved  by  minimising  a  weighted  sum,  ^Ew{V)-]-jC(w)  of  the  quadratic  error  of 
the  student  on  the  training  examples,  £’w(P),  and  some  cost  function,  C{w),  which 
penalises  over  complex  models.  Provided  7  is  non-zero  this  serves  to  alleviate  the 
problem  of  overfitting.  It  is  the  setting  of  the,  so-called,  hyperparameters  /?  and  7 
which  we  will  examine  in  this  presentation. 

In  terms  of  practical  methods  for  hyper  parameter  assignment  there  are  essentially 
two  choices.  Firstly  one  can  attempt  to  estimate  the  generalisation  error  {e.g.  by 
cross-validation  [6])  and  then  optimise  this  measure  with  respect  to  the  hyperpa¬ 
rameters.  However,  such  an  approach  can  be  computationally  expensive.  Secondly, 
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one  can  optimise  some  other  measure  and  hope  that  the  resulting  assignments  pro¬ 
duce  low  generalisation  error.  In  particular,  MacKay  [6]  advocates  the  evidence  as 
such  a  measure.  Model  selection  based  on  the  evidence,  in  the  case  of  a  linear  stu¬ 
dent  and  teacher,  has  been  studied  by  Bruce  and  Saad  [1]  in  the  thermodynamic 
limit.  Their  results  show  that  optimising  the  average,  over  all  possible  data  sets 
I>,  of  the  log  evidence  with  respect  to  the  hyperparameters  optimises  the  average 
generalization  error.  An  average  case  analysis  of  an  unlearnable  scenario  can  be 
found  in  [3]  and  shows  that  in  general  the  evidence  need  not  be  optimal  on  average. 
In  this  paper  we  examine  hyperparameter  assignment  from  the  evidence  based  on 
an  individual  data  set,  in  the  learnable  linear  case.  In  the  next  section  we  review 
the  evidence  framework  and  introduce  the  generalization  error.  In  section  3,  we 
show  that  the  evidence  procedure  is  unbiased  and  that  the  evidence  and  generaliza¬ 
tion  error  are  self  averaging.  In  section  4  we  examine  hyperparameter  assignment 
from  the  evidence  based  on  a  particular  data  set.  First  order  corrections  to  the 
performance  measures  show  that  in  general  the  evidence  procedure  does  not  lead 
to  optimal  performance.  Finally,  we  corroborate  these  conclusions  using  a  numeri¬ 
cal  study  which,  furthermore,  reveals  that  even  leave  one  out  cross-validation  is  a 
superior  model  selection  criterion  to  the  evidence  in  the  learnable  linear  case  for 
small  systems. 

2  Objective  Functions 

2.1  The  Evidence 

Since  £^(1))  is  the  sum  squared  error  then,  if  we  assume  that  our  data  is  corrupted 
by  Gaussian  noise  with  variance  1/2/?,  the  probability,  or  likelihood  of  the  data(X>) 
being  produced  given  the  model  w  and  /?  is  P{V  |  /?,  w)  oc  The  complex¬ 

ity  cost  can  also  be  incorporated  into  this  Bayesian  scheme  by  assuming  the  a  priori 
probability  of  a  rule  is  weighted  against  ’complex’  rules,  F(w  |  7)  oc  Mul¬ 

tiplying  the  likelihood  and  the  prior  together  we  obtain  the  post  training  or  student 
distribution,  P(w  |  oc  e-^^w(x»)-7C'(w)^  j|.  -g  most  probable 

model  w*  is  given  by  minimizing  the  composite  cost  function  +  7C'(w) 

with  respect  to  w. 

The  evidence,  P(V  |  7,/?),  is  the  normalisation  constant  for  the  post  training  dis¬ 
tribution.  That  is,  the  probability  of  (or  evidence  for)  the  data  set  (V)  given  the 
hyperparameters  P  and  7.  Throughout  this  paper  we  refer  to  the  evidence  proce¬ 
dure  as  the  process  of  fixing  the  hyperparameters  to  the  values  that  simultaneously 
maximize  the  evidence  for  a  given  data  set. 

2.2  The  Generalization  Error 

We  will  use  the  notation  {f{z))p  to  denote  the  average  of  the  quantity  f{z)  over 
the  distribution  P{z).  However,  we  will  use  the  short  hand  {.)w  to  mean  the  average 
over  the  post  training  distribution  P(w  |  P,7,/?).  As  our  performance  measure  we 
choose  the  expected  difference  over  the  input  dimension  P(x)  between  the  average 
student  and  the  average  teacher.  That  is,  the  data  dependent  generalisation  error, 
=  ({(!/((x)>f(!„|x)  -  {y,{x))  w)^)p(x)*  If  we  were  to  average  over  all  possible 
data  sets  of  fixed  size  then  this  would  correspond  to  the  generalization  error  studied 
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3  Finite  System  Size 

Since  the  student  is  linear  with  output  y{x)  =  w.x/y/N,  Ng  —  N .  We  also  assume 
that  the  teacher  mapping  is  linear,  with  weights  w°,  and  corrupted  by  zero  mean 
Gaussian  noise  of  variance  Thus,  P(yt  \  x^)  oc  .Xf,/VN)  l2o  Farther, 

we  assume  P(x)  is  A/*(0,(7a;)^  and  adopt  weight  decay  as  our  regularization  proce¬ 
dure,  that  is  C'(w)  =  w^w.  In  this  case  we  can  explicitly  calculate  the  evidence, 
or  rather  the  normalised  log  of  the  evidence  /(P)=  —1/N  \t\[P{T)  \  A,/?),  where 
we  have  introduced  the  weight  decay  parameter  X  =  jj/S.  The  generalisation  error 
and  the  consistency  can  be  calculated  from  /(Z>)  by  averaging  appropriate  expres¬ 
sions  over  the  input  distribution  P{x).  Details  of  these  calculations  will  appear  in 
a  subsequent  paper  [4]. 

3.1  Consistency,  Unbiasedness  and  Self  Averaging 

Firstly,  we  examine  the  free  energy,  /(V),  and  the  generalisation  error  in  the  limit 
of  large  amounts  of  data  {i.e.  as  p  — >■  oo  with  N  fixed).  Using  the  central  limit 
theorem  we  can  show  that,  in  this  limit,  to  first  order  the  generalisation  error  is 
independent  of  the  weight  decay  whilst  /  is  optimised  by  Xev  —  Ao  s  and 

Pev  —  Po  =  As  we  shall  see  later  in  the  context  of  large  N  this  insensitivity 

of  the  generalisation  error  to  the  value  of  the  weight  decay  is  associated  with  a 
divergence  in  the  variance  of  the  optimal  weight  decay  as  the  number  of  examples 
grows  large.  This  asymptotic  insensitivity  to  the  weight  decay  is  a  reflection  of  the 
fact  that  our  linear  student  is  mean  square  consistent.  We  will  thus  focus  on  the 
following  quantity  when  assessing  the  evidence  procedure’s  performance. 


(1) 


Secondly,  it  can  be  shown  that  oc  6ij,  then  the  resulting  average 

free  energy,  f=:[f{V))p(p^  is  extremised  by  A  =  Ao  and  /?  =  /?o.  Similarly,  the 
average  generalisation  error  is  optimised  by  Aq.  This  corresponds  to  the  average 
case  result  obtained  for  the  thermodynamic  limit  in  [1]  but  is  valid  for  all  N  and 
p.  Thus,  the  particular  conclusion,  of  the  thermodynamic  average  case  analysis, 
for  the  learnable  linear  scenario,  that  the  evidence  procedure  optimises  average 
performance  is  valid  for  all  N  and  in  this  sense  the  procedure  is  unbiased. 

Finally  using  results  of  [9]  ^  one  can  show  that  the  variance,  over  possible  reali¬ 
sations  of  the  data  set,  of  the  free  energy,  f{V)  is  order  0{\/N)  as  we  approach 
the  thermodynamic  limit;  it  is  a  self  averaging  quantity.  Similarly,  it  can  also  be 
shown  that  the  generalization  error  is  self  averaging.  This  means  that  in  the  ther¬ 
modynamic  limit  the  behaviour  exhibited  by  the  system  for  any  particular  data  set 
will  correspond  to  the  average  case  behaviour,  that  is  the  fluctuations  around  the 
average  vanish.  Thus,  the  average  case  analysis  of  [1]  corresponds  to  the  case  for  a 
particular  data  set  because  their  results  were  obtained  in  the  thermodynamic  limit. 

4  Data  Dependent  Hyperparameter  Assignment 

Having  now  established  that  the  evidence  procedure  is  unbiased  and  that  the  free 
energy  and  performance  measures  are  self  averaging  we  now  wish  to  examine  the 
system  behaviour  for  particular  data  sets  of  finite  size.  This  is  clearly  the  regime 

^  Where  N(x,(t)  denotes  a  normal  distribution  with  mean  x  and  variance 

^Alternatively  one  can  show  this  result  using  diagrammatic  methods. 
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Figure  1  The  variance  in  the  optimal  weight  decay  (yar{Aeg))  for  various  noise 
levels,  (i)  Aq  =  0.04,  (ii)  Aq  =  0.25  and  (iii)  Ao  =  4/9  is  shown  in  the  left-hand 
graph.  Notice  the  linear  divergence  in  a  which  corresponds  to  our  result  in  section 
3.1  that,  for  sufficiently  large  p,  the  generalization  error  is  independent  of  A.  The 
variance  in  the  evidence  optimal  weight  decay  (Tar(Aev))  is  shown,  in  the  right- 
hand  graph,  for  the  same  noise  levels.  The  0{l/a)  decay  of  this  quantity  is  a 
reflection  of  the  fact  that  for  large  p  Xev(V)  =  Aq. 


of  interest  to  real  world  applications  since  one  is  then  in  the  business  of  optimising 
performance  based  on  a  particular  data  set.  To  obtain  the  hyperparameter  assign¬ 
ments  made  by  the  evidence  procedure  we  must  simultaneously  solve  dxf{T>)  =  0 
and  5^/(X>)  =  0,  where  def  =  dfidd.  We  can  linearize  these  equations,  close  to  the 
thermodynamic  limit,  by  expanding  around  A  =  Aq  and  ^  =  /?o.  Similarly,  we  can 
also  expand  the  true  optimal  weight  decay  about  the  thermodynamic  limit  value, 
Aq. 

We  find  that  (co)-variances  of  these  quantities  are  0{l/N).  Figure  1  shows,  to  first 
order,  the  scaled  variances  ^  in  the  evidence  optimal  weight  decay,  V ar{\ev)  and 
that  in  the  true  optimal  weight  decay,  Var(\opt).  The  asymptotic  0(1 /a)  decay 
of  the  former  reflects  the  fact  that,  as  discussed  in  section  3.1,  lima-H.oo  ^ev(T^)  = 
Aq.  Similarly,  the  divergence  of  the  latter  is  indicative  of  the  insensitivity  of  the 
generalization  error  to  the  weight  decay  for  large  a.  The  divergence  of  both  curves 
for  small  a  is  order  0(1 /No)  and  reflects  the  break  down  of  the  thermodynamic 
limit  when  the  number  of  examples  p  does  not  scale  with  the  system  size  A", 
Similarly  we  find  that  the  average  squared  distance  between  the  evidence  assign¬ 
ment  and  the  optimal,  <  (\q(V)  -  \^y(V)f  >p{v)^  is  order  0(\/N).  This  distance 
is  non  zero,  except  for  a  >  1  in  the  noiseless  limit.  Further,  in  the  large  oc  limit 
this  distance  diverges,  revealing  the  inconsistency  of  the  evidence  weight  decay 
assignment. 

4.1  Effects  on  Performance 

We  now  examine  the  effects  on  performance  of  this  sub-optimal  hyperparameter 
assignment.  Firstly,  to  order  0(\/VN)  the  optimal  performance,  eg(\opt,V),  and 
that  resulting  from  use  of  the  evidence  procedure,  eg(Xey,V)  are  the  same.  However, 

^  i.  e.  N  times  the  true  variances 
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to  order  0(1/ N)  they  differ  thus  we  can  write  the  correlation  between  them,  some¬ 
what  suggestively,  as  1  —  0{1/N).  Unfortunately,  we  are  unable  to  calculate  this 
correlation  to  0{1/N).  Therefore,  we  examine  <  >p{v)  which  tends  to  l/N  in 

the  limit  of  large  a  reflecting  the  inconsistency  of  the  evidence  weight  decay  assign¬ 
ments.  In  the  limit  of  no  noise  for  a  >  1  we  find  that  <  >p(v)= 

revealing  that  even  for  small  noise  levels  the  evidence  procedure  is  sub-optimal. 

4.2  Comparison  with  Cross-validation 

Given,  that  the  evidence  procedure  is  sub-optimal  it  is  natural  to  ask  if  another 
model  selection  criteria  could  do  better.  Here  we  compare  the  evidence  procedure 
with  leave-one-out  cross-validation  using  simulations  of  a  1-dimensional  system. 
That  is  we  set  the  weight  decay  using  the  cross-validatory  estimate  and  the  evi¬ 
dence  estimate  and  compare  the  resulting  generalisation  error  to  the  optimal.  The 
results,  averaged  over  1000  realisations  of  the  data  set  for  each  value  of  p,  are  plot¬ 
ted  in  figure  2.  These  show  that  the  evidence  weight  decay  assignment  results  in 
sub-optimal  performance  with  e^(Aev,  P)  not  fully  correlated  with  e^(Ao, P).  More¬ 
over,  the  left-hand  graph  shows  that  the  resulting  error  from  the  cross-validatory 
estimate  correlates  more  strongly  with  the  optimal  generalisation  error  than  does 
that  resulting  from  the  evidence  estimate.  In  addition,  the  right-hand  graph  shows 
that  the  fractional  increase  in  the  generalisation  error  is  considerably  larger  for  the 
evidence  procedure  than  for  cross-validation. 


Figure  2  1-D  simulation  results;  The  left-hand  graph  shows  the  correlation  be¬ 

tween  the  optimal  generalization  error  and  those  obtained  using  the  evidence  (solid) 
and  cross-validation  (chain)  with  Ao  =  1.0.  The  right-hand  graph  shows  the  frac¬ 
tional  increase  in  generalization  error  Keg(A)  =  (eg(A)  —  Cg{Xopt))/^g{^opt)-  A  is  set 
by  the  evidence  (dashed)  and  by  cross-validation  (chain)  for  Ao  =  1.0.  For  Aq  =  0.01 
the  evidence  case  is  the  solid  cimve  cross-validation  the  dotted  curve.  In  the  latter 
case  the  error  bars  are  not  shown  for  the  sake  of  clarity. 


5  Conclusion 

We  have  shown  that,  despite  thermodynamic  and  average  case  results  to  the  con¬ 
trary,  model  selection  based  on  the  evidence  does  not  optimise  performance  even 
in  the  learnable  linear  case.  In  addition,  numerical  studies  indicate  that  for  small 
systems  cross-validation  is  closer,  than  the  evidence  procedure,  to  optimal  perfor- 
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mance.  However,  for  large  systems  the  evidence  might  still  be  a  reasonable  alterna¬ 
tive  to  the  computationally  expensive  cross-validation. 
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Radial  basis  function  (RBF)  networks  have  the  great  advantage  that  they  may  be  determined 
rapidly  by  solving  a  linear  least  squares  problem,  assuming  that  good  positions  for  the  radial 
centres  may  be  selected  in  advance.  Here  it  is  shown  that,  if  there  is  some  structure  in  the  data, 
for  example  if  the  data  lie  on  lines,  then  variables  in  Gaussian  RBFs  may  be  separated  and  a 
near-optimal  least  squares  solution  may  be  obtained  rather  efficiently.  Second,  it  is  shown  that 
a  system  of  Gaussian  RBFs  with  structured  or  scattered  centres  may  be  orthogonalized  over  a 
continuum  or  discrete  data  set  and  thereafter  the  least  squares  solution  is  immediate.  Keywords: 
Gaussians,  orthogonalization,  RBF,  separability. 

1  Introduction 

Suppose  that  there  are  m  input  data  and  that  each  datum  x  has  d 

components  xi, . , . ,  Suppose  that  we  adopt  a  radial  basis  function  network  with 

the  centres  . . . ,  given  by  . . . ,  ,  for  i  —  1, . . . ,  n. 

The  components,  w'p  of  the  centres  are  “weights”  in  the  network,  between  the  input 
and  hidden  layer.  Frequently  good  choices  of  centres  may  be  made  by  clustering 
techniques  [3],  and  so  we  assume  vfp  to  be  fixed.  Suppose  that  coefficients  Cj,  for 
i  =  1, . .  .,n,  are  associated  with  radial  basis  (transfer)  functions  applied  to 

the  argument  r^,  where 


The  coefficients  {ci}  are  weights  between  the  hidden  and  output  layers,  and  the 
output  function  /(x)  is  approximated  by  an  RBF,  F{x),  as  follows, 

/(x)  «  F(x)  =  (||x- . 

1  =  1 

Frequently  the  RBFs  {<^i(r)}  are  taken  to  be  the  same  function  <^(r)  for  each  i.  In 
this  paper  we  consider  the  basis  function  <f){r)  —  exp(— r^). 

2  Separability  of  Gaussians 

The  Gaussian  has  some  particular  advantages  in  terms  of  (i)  its  separability  and 
(ii)  the  simplicity  of  its  inner  product.  In  this  section,  we  consider  separability  and 
in  Section  3  we  exploit  the  inner  product  in  discussing  orthogonalization. 

If  <^(r)  =  exp(— r^),  then 

‘!^(l|x||)  =  (I>ixi)<l>{x2) . . .  4>{xd). 

Thus  the  RBF  is  a  product  of  d  one-dimensional  Gaussians. 
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2.1  Fast  Approximation  on  a  Mesh 

Suppose  that  the  2-dimensional  data  x  =  (2^1, 0^2)^  and  the  corresponding  centres 
w  {wi,W2y  are  each  placed  on  meshes 


Xjfc/  =  (2:^^ 

.4‘>) 

for  ^  1, . . 

, ,  and  =  1, . 

..,my, 

f  (U 

Wij  = 

S’ 

il 

, « j  and  ^  * « 

.  \  Tly  , 

where  Ux  <  rrix  and  Uy  <my. 

Here  and  riy  are  the  numbers  of  different  values  for  the  two  respective  compo¬ 
nents  of  the  centres,  and  rrix  and  nriy  are  the  corresponding  numbers  of  different 
values  for  the  two  components  of  the  data.  Then  our  RBF  approximation  on  the 
data  is 


For  each  i  =  1, . . . ,  ,  we  obtain  the  (overdetermined)  linear  system, 

/(xt,«)  =  ^  (2) 

i=l 

for  k  —  1, . . . ,  TTia,,  where,  for  each  i  =  1, . . . ,  Ua;,  we  have  the  (overdetermined) 
system 

=  Z]  -  “'2'’’)  (3) 

j=i 


for  i=  1  ,  •  .  .  ,  TTly  . 

Using  matrix  notation,  we  may  express  the  system  of  equations  (2)  as 

for  ^  =  1, 2, . . . ,  my,  where  is  a  matrix  whose  {k,  i)th  element  is  ~ 

is  a  vector  of  the  elements  for  i  =  1, 2, . . . ,  and  is  a  vector  of  the 
elements  for  k  =  1,2, . .  .,mx.  We  note  that  the  matrix  A^  is  independent 

of  £  and  so  it  is  only  necessary  to  factorize  it  once. 

Similarly,  we  may  express  the  system  (3)  as 

AyC^’^  =  bi 

for  i  —  1,  2, . . .,  71a;,  Here,  the  matrix  Ay  is  independent  of  i  and  so,  again,  it  is 
only  necessary  to  factorize  this  matrix  once.  Thus,  the  solution  of  (2)  and  (3)  by 
QR  factorization  can  be  achieved  in  O  {mynx{mx  -}-  riy))  operations.  This  is  a  great 
saving  over  the  O  (mxmyU^ny)  operations  that  would  be  required  if  the  system  (1) 
of  rUxTUy  equations  were  solved  by  least  squares  without  exploiting  any  structure. 

2.2  Data  on  Lines 

If  the  data  are  less  structured  than  a  mesh,  but  do  form  lines,  then  fast  methods 
may  be  adopted.  Consider  lines  of  data,  where  the  X2  values  are  fixed  but  the  xi 
values  are  scattered  (in  different  positions  on  each  line).  The  data  abscissae  may 
be  written  as  (x[^'^\  for  A;  =  1, . . . ,  and  £  =  1, . , , ,  my,  so  the  xi  values 
now  vary  with  both  k  and  £.  Equations  (2)  and  (3)  are  still  valid,  but  with  and 
rux  replaced  by  and  m^.  We  may  therefore  still  solve  (2)  and  (3)  in  the  least 
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squares  sense.  However,  the  solution  is  no  longer  the  true  least  squares  solution 
of  (1).  Indeed  the  solution  of  (2)  now  involves  a  different  matrix  for  each  value  of  £, 
and  this  also  means  that  the  solution  is  less  efficient  —  O  (rrixmynl)  operations  are 
needed  to  solve  (2),  although  only  O  (mynytix)  operations  are  still  required  for  (3). 
The  method  we  describe  here  is  the  Gaussian  RBF  analogue  of  the  methods  of 
Clenshaw  and  Hayes  [2]  for  polynomial  approximation  and  of  Anderson,  Cox  and 
Mason  [1]  for  spline  approximation  of  data  on  lines. 

3  Orthogonalized  Gaussians 

Gaussians  can  be  orthogonalized  in  a  number  of  ways.  Suppose  that  for  data  {x}  = 
{(a^i, . . . ,  we  have  centres  w^)  for  z  =  1, . . . ,  n,  and  Gaussian  RBFs  <;iit(||x||)  = 
exp(-||x  -  w6)|p)  for  z  =  1, . . . ,  n. 

Then  a  general  orthogonalization  technique,  based  on  the  Gram-Schmidt  procedure, 
is  to  form  a  new  basis  . . . ,  as, 

i>i  =  (4) 

k-l 

fc=l 

for  z  =  1, 2, . . . ,  n,  where  and  are  appropriate  coefficients  such  that  =  1 
and  =  1.  The  values  of  the  remaining  coefficients  are  determined  by  requiring 
the  new  basis  functions,  . . . ,  V’n,  to  be  orthogonal  with  respect  to  some  specified 
inner  product.  Thus  we  have 

<'0i,V’j>=O  forz'T^j. 

Let  us  define 

ti,j  — (^) 


and 


rtk  =<  'fpkNk  >, 


the  values  of  tij  being  constants  that  may  be  calculated  once  and  for  all,  and  the 
values  of  n*  being  normalization  constants.  Then,  by  setting  <  'ipiNj  >—  0  where 
(without  loss  of  generality)  z  <  j  we  deduce,  using  equations  (4)  and  (5),  that 


Xj)  _ 


-1 


kzzl 


(7) 


By  taking  inner  products  of  both  sides  of  equation  (5)  with  themselves,  we  deduce 
that 

i-l 

ni=ti,i,  and  (8) 

*=1 


Expanding  equation  (5)  using  (4)  and  equating  similar  terms,  we  derive 

=  (9) 

k=i 

Equations  (7)-(9)  suffice  to  determine  Uk,  and  for  all  z  <  j  given  values  of 
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3.1  Continuum 

Let  us  adopt  an  inner  product  over  IR^  =  (—00,  00)^  and  define 

V  =  f  exp(-||xf)dS  =  /  exp(-||x- w|p)dS, 

for  any  fixed  w.  Now  tij  is  readily  calculated  by  using  the  following  formula  (whose 
derivation  follows  from  the  cosine  rule) , 

||x  -  +  ||x  -  w«)||2  =  2||x  -  Wy  ||2  +  t||w«  -  wW)||2,  (10) 

where  Wij  =  |(w6)  + 

Now  equation  (6)  becomes, 

tj^  j  =.  <  (f)i,(l)j  > 

=  J  exp(— ||x  —  —  ||x  —  w^-^4||2^  dS 

=  exp(-i||w(*>  -  w('’)||^)  J exp(-2||x- Wiy|p)dS 

=  (V^y^V  '  . 

Thus  tij  is  immediately  determined  from  {w^)}  without  sums  or  integrals  (V  being 
a  scalar  multiplier),  and  the  orthogonalization  procedure  is  particularly  straight¬ 
forward. 

This  inner  product  is  the  natural  one  to  use  for  a  continuum  of  data.  Indeed  the 
best  least  squares  approximation  to  /  of  the  form 

n 

i=l 

is  defined  by  setting 

a  =<  F,'ipi  >  I  Hi. 

3.2  Scattered  Discrete  Data 

We  might  also  adopt  this  inner  product  for  a  discrete  data  set,  since  it  has  the 
advantage  of  being  data  independent.  The  snag  is  that  we  do  not  then  obtain  a 
diagonal  system  for  determining  a  least  squares  approximation  on  the  data  set  and 
must  solve 

n 

i  —  1 


for  i  =  1, . .  .,n,  where  [F, (j)]  denotes  the  inner  product  over  the  discrete  data  set. 
However,  we  would  expect  the  matrix  with  entries  [ipi,  ipj]  to  be  “close  to  diagonal” . 
Alternatively,  if  we  define  a  discrete  inner  product  over  the  data  {x^^^} 


and  write 


<  f,a 

x('=) 

5,,- =  exp(-||x(^)  -  w6)|H  <^,(xW) 


then 


ti  j  — <  <j)i.i(l>j  >—  (5'  <S')ij 
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where  S  is  the  matrix  of  Sk,i-  Thus  we  are  effectively  forming  the  components  of 
the  normal  matrix.  The  calculation  is  similar  to  that  of  Section  3.1,  except  that 
S^S  is  now  used  rather  than  S  (hence  requiring  a  more  complicated  calculation). 

4  Results 

The  validity  of  the  procedure  of  Section  3.2  has  been  successfully  tested  by  using 
orthogonalized  RBFs  to  recognize  the  ten  digits  0,. .  .,9  from  their  pixel  patterns  — 
we  do  not  have  sufficient  space  here  to  give  details,  but  note  that  the  procedure  is 
very  fast  compared  with  using  back  propagation  procedures  and  sigmodal  approxi¬ 
mations.  We  have  also  calculated  condition  numbers  for  a  variety  of  RBF  matrices 
which  occur  in  data  fitting,  and  there  are  apparent  advantages  for  conditioning  in 
using  orthogonality  on  a  continuum  rather  than  on  a  discrete  data  set.  However 
further  work  is  needed  to  develop  an  orthogonalization  algorithm  which  consistently 
improves  conditioning  compared  with  the  use  of  a  conventional  Gaussian  basis. 
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We  consider  the  problem  of  estimating  a  density  function  from  a  sequence  of  N  independent  and 
identically  distributed  observations  x,-  taking  values  in  IR^.  The  estimation  pro cedme  constructs 
a  convex  mixture  of  ‘basis’  densities  and  estimates  the  parameters  using  the  maximum  likelihood 
method.  Viewing  the  error  as  a  combination  of  two  terms,  the  approximation  error  measuring  the 
adequacy  of  the  architecture,  and  the  estimation  error  resulting  from  the  finiteness  of  the  sample 
size,  we  derive  upper  bounds  to  the  total  error,  thus  obtaining  bounds  for  the  rate  of  convergence. 
These  results  then  allow  us  to  derive  explicit  expressions  relating  the  sample  complexity  and  model 
complexity  needed  for  learning. 

1  Introduction 

There  have  traditionally  been  two  principal  approaches  to  density  estimation,  namely 
the  parametric  approach  which  makes  stringent  assumptions  about  the  density,  and 
the  nonparametric  approach  which  is  essentially  distribution  free.  In  recent  years, 
a  new  approach  to  density  estimation,  often  referred  to  as  the  method  of  sieves  [2] 
has  emerged.  In  this  latter  approach,  one  considers  a  family  of  parametric  models, 
where  each  member  of  the  family  is  assigned  a  ‘complexity’  index  in  addition  to 
the  parameters.  In  the  process  of  estimating  the  density  one  usually  sets  out  with  a 
simple  model  (low  complexity  index)  slowly  increasing  the  complexity  of  the  model 
as  the  need  may  be.  This  general  strategy  seems  to  exploit  the  benefits  of  both  the 
parametric  as  well  as  the  nonparametric  approaches,  namely  fast  convergence  rates 
and  universal  approximation  ability,  while  not  suffering  from  the  drawbacks  of  the 
other  methods.  In  this  paper  we  consider  the  representation  of  densities  as  convex 
combinations  of  ‘basis’  densities,  thus  permitting  an  interpretation  as  a  mixture 
model.  We  split  the  problem  into  two  separate  issues,  namely  approximation  and 
estimation,  which  are  discussed  at  more  length  in  section  2. 

2  Statement  of  the  Problem 

The  problem  of  density  approximation  by  convex  combinations  of  densities  can  be 
phrased  as  follows:  we  wish  to  approximate  a  class  of  density  functions,  by  a  convex 
combination  of  ‘basis’  densities.  In  this  work  we  consider  the  class,  E,  of  compactly 
supported  and  continuous  a.e  density  functions.  We  thus  seek  linear  combinations 
of  densities  of  the  form 

n  n 

/n(x)  =  X^ai?S(x;e,)  (Q:i>0  fc  y^ai  =  l),  (1) 

i—1  1=1 

where  (;6(x;0i)  represent  a  family  of  densities,  the  members  of  which  are  parame¬ 
terized  by  the  parameter  vectors  {9i}.  We  denote  the  full  set  of  parameters  by  6, 
namely  9  =  {{a*},  In  particular,  we  wish  to  find  values  n*  and  9*  such  that 

for  any  e  >  0,  <  e,  where  /*  is  the  value  of  evaluated  at  9  ~  9*  and 

n  =  n* .  This  objective  can  be  attained  by  a  specific  choice  of  the  basis  densities  (f), 
so  that  is  dense  in  T  (see  [2]  for  exact  conditions).  Here  d{f,g)  represents 

some  generic  distance  function  between  densities  /  and  g,  whose  exact  form  will 
be  specified  in  the  next  section.  We  note  that  in  the  usual  formulation  of  mixture 
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estimation,  one  usually  considers  a  fixed  value  of  n,  assuming  that  the  true  density 
is  a  member  of  the  class.  In  our  formulation  n  is  a  parameter,  whose  magnitude 
will  be  bounded. 

Another  line  of  recent  work  is  that  of  function  approximation  through  linear  com¬ 
binations  of  non-linearly  parameterized  ‘basis’  functions  (for  example  neural  net¬ 
works).  The  novel  feature  concerning  the  representation  given  in  eq.  (1),  as  com¬ 
pared  with  the  function  approximation  literature,  is  that  we  demand  the  coefficients 
ai  to  be  nonnegative  and  sum  to  one,  and  moreover  require  the  functions  <^(x;  9) 
to  be  densities,  i.e.  <^(x;  0)  >  0  and  f  (/)(x;  9)dx  =  1. 

As  discussed  above,  the  establishment  of  the  existence  of  a  good  approximating 
density  f*  is  only  the  first  step  in  the  estimation  procedure.  One  still  needs  to 
consider  an  effective  procedure,  whereby  the  optimal  function  can  be  obtained,  at 
least  in  the  limit  of  an  infinite  amount  of  data.  Assuming  the  estimation  is  based  on 
the  finite  data  set  {xj}f^i,  and  denoting  the  estimated  density  by  /„^Ar,the  minimal 
requirement  (referred  to  as  consistency)  is  that  fn,N  — >  /*  as  A  — ^  oo,  where 
the  convergence  takes  place  in  some  well  defined  probabilistic  sense.  Since  we  are 
interested  in  this  paper  in  convergence  rates,  we  will  in  fact  make  use  of  stronger 
results  [4]  which  actually  characterize  the  rates  at  which  the  above  convergence 
takes  place  (see  section  4). 

In  summary  then,  the  basic  issue  we  address  in  this  work  is  related  to  the  rela¬ 
tionship  between  the  approximation  and  estimation  errors  and  (i)  the  dimension 
of  the  data,  d,  (ii)  the  sample  size,  A,  and  (iii)  the  complexity  of  the  model  class 
parameterized  by  n. 

3  Preliminaries 

In  order  to  discuss  approximation  of  densities  we  must  define  an  appropriate  dis¬ 
tance  measure,  d{fyg),  between  densities  /  and  g.  A  commonly  used  measure  of 
discrepancy  between  densities  is  the  so-called  Kullback-Leibler  (KL)  divergence, 
given  by  dK{f\\g)  =  /  /log  As  is  obvious  from  the  definition,  the  KL  divergence 
is  not  a  true  distance  function  since  it  is  not  symmetric.  To  circumvent  this  prob¬ 
lem  one  often  resorts  to  an  alternative  definition  of  distance,  namely  the  squared 
Hellinger  distance  d^{fyg)  =  f  {y/J  ~  y/g)^ ,  which  can  be  shown  to  be  a  true 
metric  and  is  particularly  useful  for  problems  of  density  estimation. 

Using  the  results  of  [2]  concerning  the  method  of  sieves  we  conclude  that  under 
appropriate  conditions  on  0,  the  target  density  /  belongs  to  the  closure  of  the 
convex  hull  of  the  set  of  basis  densities  ^  =  {(/){■,  ^)}.  The  question  arises,  however, 
as  to  how  many  terms  are  needed  in  the  convex  combination  in  order  to  achieve  an 
approximation  error  smaller  than  some  arbitrary  e  >  0.  The  answer  to  this  question 
can  be  obtained  using  a  remarkable  lemma  of  Maurey  which  is  proved  for  example 
in  [1], 

4  Main  Results 

Using  the  results  of  Maurey  referred  to  above,  it  can  be  shown  that  given  any  e  >  0 
one  can  construct  a  convex  combination  of  densities,  (j)^  E  in  such  a  way  that 
the  total  error  between  an  arbitrary  density  and  the  model  is  smaller  than  e.  We 
consider  now  the  problem  of  estimating  a  density  function  from  a  sequence  of  d~ 
dimensional  samples,  {x*},  f  =  1,2,...,  A,  which  will  be  assumed  throughout  to 
be  independent  and  identically  distributed  according  to  /(x).  As  in  eq.  (1)  we  let 
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n  denote  the  number  of  components  in  the  convex  combination.  The  total  number 
of  parameters  will  be  denoted  by  m,  which  in  the  problem  studied  here  is  0{nd). 
In  the  remainder  of  this  section  we  consider  the  problem  of  estimating  the  pa¬ 
rameters  of  the  density  through  a  specific  estimation  scheme,  namely  maximum 
likelihood,  corresponding  to  the  optimization  problem,  6ri,N  =  argmax^  L(x^;^) 
where  L(x^;0)  is  the  likelihood  function,  L{x^]0)  =  Ylkfni^k),  == 
and  /,^(x)  =  ai(f)(x;0i).  We  denote  the  value  of  evaluated  at  the  max¬ 

imum  likelihood  estimate  by  fn,N-  Now,  for  a  fixed  value  of  n,  the  finite  mixture 
model,  may  not  be  sufficient  to  approximate  the  density  /,  to  the  required 
accuracy.  Thus,  the  model  for  finite  n  falls  into  the  so  called  class  of  misspecified 
models  [4]  and  the  procedure  of  maximizing  L  should  more  properly  be  referred 
to  as  quasi  maximum  likelihood  estimation.  Thus,  dn,N  should  be  referred  to  as 
the  quasi  maximum  likelihood  estimator.  Since  the  data  are  assumed  to  be  i.i.d, 
it  is  clear  from  the  strong  law  of  large  numbers  that  ~L{x^;9)  £'log/,^(x) 

almost  surely  as  — »■  oo,  where  the  expectation  is  taken  with  respect  to  the  true 

(but  unkown)  density,  /(x),  generating  the  examples.  From  the  trivial  equality 
=  — dK(/||/n)  +  -^log/(x)  we  See  that  the  maximum  likelihood  estima¬ 
tor  en,N  is  asymptotically  given  by  9*^,  where  9^  =  argmin^  ^kC/H/^).  For  further 
discussion  see  [4]. 

Now,  the  quantity  of  interest  in  density  estimation  is  the  distance  between  the 
true  density,  /,  and  the  density  obtained  from  a  finite  sample  of  size  N.  Using  the 
previous  notation  and  the  triangle  inequalitiy  for  metric  c?(*,  •)  we  have  d(/,  /„,7v)  < 
^(/)  fn)  +  d{f*,fn,N)-  This  inequality  stands  at  the  heart  of  the  derivation  which 
follows.  We  will  show  that  the  first  term,  namely  the  approximation  error^  is  small. 
This  demonstration  utilizes  Maurey’s  lemma  as  well  as  several  inequalities.  In  order 
to  evaluate  the  second  term,  the  estimation  error,  we  make  use  of  the  results  of 
White  [4]  concerning  the  asymptotic  distribution  of  the  quasi  maximum  likelihood 
estimator  0n,N-  The  splitting  of  the  error  into  two  terms  in  the  triangle  inequality, 
is  closely  related  to  the  expression  of  the  mean  squared  error  in  regression  as  the 
sum  of  the  bias  (related  to  the  approximation  error)  and  the  variance  (akin  to  the 
estimation  error). 

As  mentioned  above,  Maurey’s  lemma  provides  us  with  an  existence  proof,  in  the 
sense  that  there  exists  a  parameter  value  9^  such  that  the  error  of  the  combination 
(1)  is  smaller  than  c/n.  Since  we  are  dealing  here  with  a  specific  estimation  scheme, 
namely  maximum  likelihood,  which  asymptotically  approaches  a  particular  param¬ 
eter  value  ,  the  question  we  ask,  however,  is  whether  the  parameter  0*  obtained 
through  the  maximum  likelihood  procedure  also  gives  rise  to  an  approximation  er¬ 
ror  of  the  same  order  as  that  of  9^ .  The  answer  to  this  question  is  affirmative,  as 
we  demonstrate  in  the  next  theorem,  which  is  the  first  main  result  of  this  section. 

Theorem  1  (Approximation  error)  Given  appropriate  assumptions,  the  Hellinger 
distance  between  the  true  density  f  and  the  density  f*  minimizing  the  KuUhack- 
Leibler  divergence,  is  bounded  as  follows:  d^{f,f*)  <  e  +  where  C:f,^  is  a 
constant  depending  on  the  class  of  target  densities  T  and  the  family  of  basis  densities 
and  e  is  an  arbitrary  precision  constant. 
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Proof  (sketch)  The  proof  follows  by  a  series  of  inequalities  relating  the  Kullback- 
Leibler  divergence  and  the  Hellinger  distance.  By  bounding  the  KL  divergence  from 
above  and  below  by  the  Hellinger  distance  and  utilizing  the  fact  that  the  maximum 
likelihood  estimator  minimizes  the  KL  divergence  we  conclude  the  proof.  □ 

We  note  that  the  arbitrary  precision  constant  e  appearing  in  the  theorem  can  be 
eliminated  by  restricting  attention  to  densities  belonging  to  the  information  closure^ 
defined  os  Q  -  {f  e  E  \  inf^ea  c?K(/|b)  “  0},  where  Q  =  and  Qn  comprises  all 
convex  combinations  of  n  terms  as  in  (1).  We  will  restrict  ourselves  in  the  sequel 
to  this  subspace.  We  stress  that  the  main  point  of  theorem  1  is  the  following. 
While  Maurey’s  lemma  guarantees  the  existence  of  a  parameter  value  6^  and  a 
corresponding  function  which  lies  within  a  distance  of  0(1/77)  from  /,  it  is  not 
clear  apriori  that  /*,  evaluated  at  the  quasi  maximum  likelihood  estimate,  0*,  is 
also  within  the  same  distance  from  /.  Theorem  1  establishes  this  fact. 

Up  to  now  we  have  been  concerned  with  the  first  part  of  the  triangle  inequality.  In 
order  to  bound  the  estimation  errc>r  resulting  from  the  maximum  likelihood  method, 
we  need  to  consider  now  the  second  term  in  the  same  equation.  To  do  so  we  will  need 
to  make  use  of  a  lemma  of  White  [4],  which  characterizes  the  asymptotic  distribution 
of  the  estimator  9n,N  obtained  through  the  quasi  maximum  likelihood  procedure. 
A  quantity  of  interest,  which  will  be  used  is  C{9)  =  A{9)~^ B{9)A{9)~^  where 
A{d)  =  E  [VV^log/«(x)]  and  B{9)  =  E  [(Vlos/^(x))  (Vlog/«(x))^]  with  the 
expectations  taken  with  respect  to  the  true  density  /,  and  the  gradient  operator 
V  represents  differentiation  with  respect  to  9.  White’s  lemma  then  proceeds  to 
give  sufficient  conditions  so  that  the  distribution  of  the  quasi  maximum  likelihood 
estimator  is  asymptotically  normal. 

Finally,  we  will  make  use  of  the  Fisher  information  matrix  defined  with  respect  to 
the  density  /* ,  which  we  shall  refer  to  as  the  pseudo-information  matrix,  given  by 
P  =  £'*[V  log/*(x)Viog /*(x)^].  The  expectation  E*  is  taken  with  respect  to  /*, 
the  density  evaluated  9  —  9* .  Denoting  expectations  over  the  data  (according 
to  the  true  density  /)  by  we  have; 

Theorem  2  (Error  bound)  For  sample  size  N  sufficiently  large,  and  given  appro¬ 
priate  smoothness  assumptions  (see  [4]),  the  expected  estimation  error, 

Evdfdf ,  fn,N)-> 

obtained  from  the  quasi  maximum  likelihood  estimator,  fn,N,  is  bounded  as  follows: 
Ev  (D:F,^/n)  A  O  (m* /N),  where  d  denotes  the  data  dimension, 

and  m*  =  Tr(C'*/*)  with  C*  and  I*  given  above. 

Proof  (sketch)  The  initial  step  in  the  proof  is  to  expand  dff{f*,fn,N)  to  a  first 
order  Taylor  series  with  remainder.  Some  simple  algebraic  manipulations  yield  an 
approximation  in  terms  of  the  quasi  maximum  likelihood  estimator  and  the  pseudo 
information  matrix.  Taking  expectation  with  respect  to  the  data  we  obtain  the 
bound  O  utilizing  the  asymptotic  results  of  White  (1982).  The  final  derivation 

follows  by  the  triangle  inequality  and  the  approximation  bound.  □ 

If  we  take  n,N  oo  so  that  (n/N)  0  the  matrix  C*  converges  to  the  inverse  of 

the  True’  density  Fisher  information  matrix,  which  we  shall  denote  by  I~^{9),  and 
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the  pseudo-information  matrix,  I*,  converges  to  the  Fisher  information  I{9).  This 
argument  follows  immediately  from  Theorem  2,  which  ensures  the  convergence  of 
the  misspecified  model  to  the  ‘true’,  underlying  density.  Therefore  their  product 
converges  to  the  identity  matrix,  with  a  dimension  of  the  parameter  vector  m  = 
n(d+  2).  The  bound  on  the  estimation  error  will  therefore  be  given  asymptotically 

by  Et)  fn,N)  <  O  +0  (^)  which  is  valid  in  the  limit  (iV  — »■  oo,  n  — > 

oo,  ^  0).  The  optimal  complexity  index  n  may  be  obtained  from  the  asymptotic 

convergence  rate  bound,  and  is  given  by  riopt  =  where  d  is  the 

dimension  of  the  data  in  the  sample  space. 

We  remark  that  the  parameter  m*  may  be  interpreted  as  the  effective  number  of 
parameters  of  the  model,  under  the  misspecification  of  finite  n.  This  parameter 
correlates  the  misspecified  model’s  generalized  information  matrix  C*,  with  the 
pseudo- information  matrix  related  to  the  density  /* ,  so  that  the  effect  of  misspec¬ 
ification  results  in  a  modification  in  the  number  of  effective  parameteres. 

5  Discussion 

We  have  considered  in  this  paper  the  problem  of  estimating  a  density  function  over  a 
compact  domain  X.  Specifically,  the  problem  is  phrased  in  the  language  of  mixture 
models,  for  which  a  great  deal  of  theoretical  and  practical  results  are  available. 
Moreover,  one  can  immediately  utilize  the  powerful  EM  algorithm  for  estimating 
the  parameters.  Finally,  we  comment  that  the  method  we  used,  namely  combining 
approximation  error  bounds  with  White’s  framework  for  misspecified  models,  gave 
rise  to  lower  estimation  bounds  than  those  obtained  so  far  for  function  estimation. 
We  believe  that  this  result  is  not  restricted  to  density  estimation,  and  can  be  directly 
applied  to  function  estimation  using,  for  example,  least-squares  estimation. 
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We  had  earlier  constructed  neural  networks  which  are  capable  of  providing  optimal  approximation 
rates  for  smooth  target  functions.  The  activation  functions  evaluated  by  the  principal  elements 
of  these  networks  were  infinitely  many  times  differentiable.  In  this  paper,  we  prove  that  the 
peirameters  of  ^my  network  with  these  two  properties  must  satisfy  certain  lower  boxmds.  Our  results 
can  also  be  thought  of  as  providing  a  rudimentary  test  for  the  hypothesis  that  the  unknown  target 
function  belongs  to  a  Sobolev  class. 

1  Introduction 

The  notion  of  a  generalized  translation  network  was  introduced  by  Girosi,  Jones  and 
Poggio  [4]  {generalized  regularization  networks  in  their  terminology).  Let  1  <  d  <  s 
be  integers,  and  (f)  :  IR^  — >■  IR  be  a  fixed  function.  A  generalized  translation  network 
with  n  neurons  evaluates  a  function  of  the  form 

n 

Y^ak(l){AkX-\-hk),  (1) 

k=i 

where  x  €  IR^  Ofe’s  are  real  numbers,  ’s  are  vectors  in  IR'^,  and  ACs  axe  dxs  real 
matrices.  This  mathematical  model  includes  the  traditional  neural  networks,  where 
d  1,  as  well  as  the  radial  basis  function  networks,  where  d  =  s,  ACs  are  all  equal 
to  the  s  X  s  identity  matrix,  and  <f)  is  a  radial  function.  Girosi,  Jones,  and  Poggio 
have  discussed  the  importance  of  the  more  general  networks  for  applications  in 
computer  graphics,  robotics,  control,  image  processing,  etc.,  as  well  as  emphasized 
the  need  for  a  theoretical  investigation  of  the  capabilities  of  these  networks  for 
function  approximation. 

One  important  reason  for  using  generalized  translation  networks  for  function  ap¬ 
proximation  is  to  obtain  a  concrete,  trainable  model  for  the  typically  unknown 
target  function.  Some  of  the  features  required  of  the  model  are  the  following.  Of 
course,  the  model  should  approximate  the  target  function  within  a  given  margin 
of  error,  utilizing  as  few  neurons  as  possible.  It  is  also  desirable  that  the  function 
evaluated  by  the  model  be  smoother  than  the  target  function.  Moreover,  it  is  also 
expected  that  the  parameters  ak,Ak,  b*  of  the  model  remain  bounded  as  the  margin 
of  error  approaches  zero.  In  this  paper,  we  prove  that  these  goals  are  incompatible 
with  each  other:  the  parameters  of  a  model  that  evaluates  a  good  approximation, 
with  (j)  smoother  than  the  target  function,  must  tend  to  infinity  as  the  margin  of 
error  tends  to  zero.  In  order  to  motivate  this  result  further,  we  first  review  certain 
known  “positive”  results. 

It  is  well  known  [1],  [2],  [5],  [6]  that  an  arbitrary  continuous  function  of  s  variables 
can  be  approximated  arbitrarily  closely  on  an  arbitrary  compact  subset  of  IR  by 
neural  networks.  A  similar  result  was  established  in  [8]  for  the  case  of  generalized 
translation  networks.  A  deeper  problem  in  this  context  is  to  determine  the  number 
of  neurons  required  to  approximate  all  functions  in  a  given  function  class  within 
a  given  margin  of  error.  Equivalently,  one  seeks  to  estimate  the  degree  of  approx¬ 
imation  of  the  target  function  in  terms  of  the  number  of  neurons,  n.  Since  the 
target  function  is  typically  not  known  in  advance,  it  is  customary  to  assume  that 
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it  belongs  to  some  known  function  class.  A  simple  assumption  is  that  the  function 
possesses  continuous  derivatives  up  to  order  r,  where  r  >  1  is  some  integer.  It  is 
well  known  [12]  that  for  any  function  satisfying  this  condition,  there  is  an  algebraic 
polynomial  of  degree  not  exceeding  m,  which  gives  a  uniform  approximation  to  this 
function  on  [-1,1]®  with  an  order  of  accuracy  In  terms  of  the  number  n 

of  parameters  involved,  this  order  is  According  to  a  result  of  DeVore, 

Howard,  and  Micchelli  [3],  this  is  asymptotically  the  best  order  of  approximation 
that  can  possibly  be  achieved  for  the  entire  class  of  target  functions,  using  any 
“reasonable”  approximation  process  involving  n  parameters.  It  is  an  open  problem 
to  determine  whether  the  same  degree  of  approximation  can  be  achieved  with  gen¬ 
eralized  translation  networks.  One  expects  that  the  actual  degree  of  approximation 
should  depend  upon  certain  growth  and  smoothness  characteristics  of  the  activation 
function  (j). 

This  author  and  Micchelli  investigated  this  problem  in  detail  in  [10],  starting  with 
the  case  when  both  the  target  function  and  the  activation  function  are  27r-periodic. 
When  <f>  is  a,  27r-  periodic  function,  we  were  able  to  approximate  the  trigonometric 
monomials  by  generalized  translation  networks,  with  a  bound  on  the  accuracy  of 
approximation  in  terms  of  the  trigonometric  degree  of  approximation  of  <^,  This 
turned  out  to  be  a  very  fruitful  idea,  enabling  us  to  establish  a  connection  between 
the  degree  of  approximation  provided  by  generalized  translation  networks  on  one 
hand,  and  the  degree  of  trigonometric  polynomial  approximation  to  the  target  func¬ 
tion  and  to  the  activation  function  (f)  on  the  other  hand.  The  general  theorem  was 
applied  to  the  case  when  ^  is  not  periodic,  establishing  degree  of  approximation 
theorems  for  a  very  wide  class  of  activation  functions.  As  far  as  we  are  aware,  our 
estimates  on  the  degree  of  approximation  by  radial  basis  functions  were  the  first 
of  their  kind,  in  that  the  estimates  were  in  terms  of  the  number  of  function  evalu¬ 
ations,  rather  than  a  scaling  factor.  These  results  were  announced  in  [11].  In  [10], 
we  constructed  networks  to  provide  to  an  optimal  recovery  of  functions,  as  well  as 
networks  to  provide  simultaneous  approximation  of  a  function  and  its  derivatives. 
The  idea  was  also  applied  in  [9]  to  obtain  certain  dimension-independent  bounds. 
Both  in  [9]  and  [10],  we  give  explicit  constructions  of  the  networks.  The  results 
indicated  that  both  the  growth  and  smoothness  of  the  activation  function  play  a 
role  in  the  complexity  problem. 

In  [7],  we  concentrated  on  activation  functions  that  are  infinitely  often  differentiable 
in  an  open  set,  and  for  which  there  is  at  least  one  point  in  this  open  set  at  which 
none  of  the  partial  derivatives  is  zero.  Using  the  ideas  in  the  paper  [6]  of  Leshno,  Lin, 
Pinkus,  and  Schocken,  we  proved  that  generalized  translation  networks  with  such 
activation  functions  provide  the  optimal  degree  of  approximation  for  all  smooth 
functions.  We  also  obtained  estimates  for  the  approximation  of  analytic  functions. 
The  activation  functions  to  which  our  results  are  applicable  include  the  squash¬ 
ing  function,  (1  -f  when  d  =  1,  and  the  Gaussian  function,  the  thin  plate 

splines,  and  multiquadrics,  when  1  <  d  <  s.  We  give  explicit  constructions,  and 
the  matrices  Ak  and  “thresholds”  in  the  networks  thus  obtained  are  uniformly 
bounded,  independently  of  the  degree  of  approximation  desired.  Unfortunately,  the 
coefficients  ajc  in  the  networks  may  grow  exponentially  fast  as  the  desired  degree  of 
approximation  tends  to  zero. 


Mhaskar:  On  Smooth  Activation  Functions 


277 


In  this  paper,  we  demonstrate  that  this  phenomenon  cannot  be  avoided  if  the  acti¬ 
vation  function  is  a  bounded  analytic  function  in  a  poly-strip;  the  coefficients  and 
the  matrices  cannot  all  be  bounded  independently  of  the  desired  degree  of  approxi¬ 
mation.  This  fact  persists  even  if  <j)  satisfies  less  stringent  conditions.  In  particular, 
all  the  special  functions  (f)  mentioned  above  necessarily  have  this  drawback.  For  the 
sake  of  simplicity,  we  have  presented  our  results  in  the  context  of  uniform  approx¬ 
imation.  They  are  equally  valid  when  the  approximation  is  considered  in  an 
space. 

2  Main  Results 

Let  ib,m  >  1  be  integers,  and  5  be  a  closed  subcube  of  IR”^  (not  necessarily 
compact).  The  class  of  all  functions  /  :  IR”"  ^  IR  having  continuous  and  bounded 
partial  derivatives  of  order  up  to  (and  including)  k  on  B  will  be  denoted  by 
In  this  section,  we  prove  that  if  the  activation  function  (f)  €  for  some  integer 
^  >  1,  and  the  target  function  /  is  not  infinitely  often  differentiable  on  [-1,1]", 
then  the  coefficients  ak  in  any  generalized  translation  network  of  the  form  (1)  that 
provides  a  uniform  approximation  of  /  on  [—1, 1]"  must  satisfy  certain  lower  bounds. 
These  lower  bounds  will  be  obtained  in  terms  of  the  norms  of  the  matrices  A},.  If 
X  ==  (a^i ,..., iCm)  €  IR"",  we  define 

|x|m  max  |arj|.  (2) 

l<j  <m 

If  d,  s  >  1  are  integers,  and  A  is  a.  d  x  s  matrix,  we  define  its  norm  by 

PH  ~  max  (3) 

lxU<l 

In  the  sequel,  c,  ci,  •  •  •  will  denote  positive  constants,  which  may  depend  upon  fixed 
quantities,  such  as  </>,  d,  and  s,  and  other  indicated  parameters  only. 

We  prove  the  following  theorem. 


Theorem  1  1  <  d  <  s,  r  >  1  6e  integers,  and  0,6  be  positive  real  numbers. 

Suppose  that  (j)  G  /  :  [-1, 1]^  IR;  f  i  I®’  ^  ^)' 

Suppose  that  for  every  integer  n  >  1,  there  exists  a  generalized  translation  network 

n 

A/n(x)  ^  ^  -|- bfc^n),  (4) 

fc  =  l 


such  that 

|/(x)  -  A/'n(x)|  <  cn~“,  xG[-1,1]",  (5) 

where  c  is  a  positive  constant  that  may  depend  upon  f,  <f>,  d,  s,  and  a  but  is  inde¬ 
pendent  ofn.  Then  there  exists  a  subsequence  A  of  integers  and  a  positive  constant 
Cl  depending  on  /,  ^,d,  s,q;,  and  e  such  that 

EKnl>Cl^ - - ), 


n  G  A,  (6) 


where 

Mn  max  l|AA;,nl|- 

l<A:<n 


(7) 


We  recall  that  one  of  the  objectives  of  approximation  is  to  obtain  approximants 
which  are  smoother  than  the  original  function.  The  theorem  is  therefore  most  in¬ 
teresting  if  (j)  has  more  derivatives  than  /.  In  this  case,  if  the  matrices  Ak,n  in 
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approximating  networks  are  then  the  sum  of  the  absolute  values 

of  the  coefficients  must  tend  to  infinity.  The  smoother  the  activation  function  (j),  the 
faster  is  this  growth.  Thus,  the  goals  of  smoothing  and  having  bounded  parameters 
are  not  compatible. 

Another  way  to  interpret  this  theorem  is  that  if,  for  a  function  /,  a  sequence  of 
approximating  networks  of  the  form  (4)  can  be  found  to  yield  the  order  of  approxi¬ 
mation  as  in  (5),  but  with  the  growth  of  the  matrices  Ak^n  and  the  coefficients  ak^n 
controlled  so  that  (6)  is  not  satisfied,  then  /  must  have  continuous  partial  deriva¬ 
tives  of  order  at  least  r.  Thus,  for  an  infinitely  many  times  differentiable  activation 
function  cj),  we  have  a  “converse  theorem”  :  the  existence  of  networks  with  properly 
controlled  growth  for  the  matrices  and  coefficients  implies  a  smoothness  of  the  tar¬ 
get  function,  which  is  unknown  to  start  with.  We  observe  that  in  the  case  of  the 
neural  networks  (d  =  1)  constructed  in  [10],  the  “weights”  Ak^n  are  0{n)  and  an 
upper  estimate  on  the  sum  of  the  absolute  values  of  the  coefficients  is  also  known. 
The  networks  of  [10]  can  therefore  be  used  to  verify  the  accuracy  of  the  starting 
hypothesis  about  the  smoothness  class  to  which  the  target  function  belongs. 

The  ideas  behind  the  proof  of  Theorem  1  lead  us  to  the  following  corollary,  which 
gives  sharper  lower  bounds  than  in  (6)  in  the  case  when  the  activation  function  is 
analytic. 

Corollary  2  Let  6  >  0  ie  a  real  number,  (p  :  IR^  — >  IR  have  an  extension  to  the 
poly- strip 

as  a  hounded,  analytic  function.  Then  the  estimate  (6)  can  he  replaced  by 

SKnl  >  (^l  +  ^j  ,  n€A,  (8) 

where  m 

Proof  of  Theorem  1.  Let  n,m  >  1  be  integers.  We  observe  that  the  function  (p 
has  continuous  and  bounded  partial  derivatives  of  order  ^  on  IR^^.  Therefore,  the 
Jackson  theorem  for  algebraic  polynomial  approximation  [12],  §5.3.11,  shows  that 
there  exist  polynomials  Pk,n^  each  of  coordinatewise  degree  at  most  m,  such  that 
|(?{»(Afc,nX-bbi)-PA,„(x)|  <  x  e  [-1,  l]^  =  1, . . . ,  n  (9) 

Writing  Qm  J2k=i  ^k,nPk,m  we  see  that  is  a  polynomial  of  coordinatewise 
degree  at  most  m.  Moreover,  from  (5)  and  (9),  we  obtain 

|/(x) -(3„(x)|  <  c|n"“  +  xe[-l,l]’,  tti,  n  1, 2,  •  ■  • , 

(10) 

If  possible,  let  (6)  be  not  true.  Then,  choosing  n  to  be  the  smallest  integer  exceeding 
^(r+O/a, 

|/(x)  -  (3m(x)|  <  X  G  [-1,  1]^ 

for  all  sufficiently  large  integer  m.  In  view  of  the  converse  theorems  for  algebraic 
polynomial  approximation,  [12]  §6.3.4,  this  implies  that  /  G  C[^  for  every  [a,  6]  C 
(— 1, 1).  This  contradiction  completes  the  proof  of  the  theorem.  □ 

The  proof  of  Corollary  2.2  is  similar,  except  that  one  uses  Bernstein’s  estimates 
similar  to  [12]  §5.4.1,  instead  of  the  Jackson- type  estimates. 
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3  Conclusions 

We  have  discussed  our  previous  work  on  the  degree  of  approximation  by  gener¬ 
alized  translation  networks.  It  is  shown  here  that  the  goals  of  having  a  smooth 
approximation,  and  bounded  coefficients  and  weights  are  incompatible  when  the 
target  function  to  be  approximated  fails  to  have  at  least  as  many  continuous  par¬ 
tial  derivatives  as  the  activation  function.  Our  theorem  can  also  be  thought  of  as 
providing  a  rudimentary  test  for  the  hypothesis  that  the  unknown  target  function 
belongs  to  a  Sobolev  class. 
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This  paper  proposes  a  new  technique  called  Regularisation  hy  Convolution  to  improve  the  gener¬ 
alisation  of  GaRBF  networks.  The  technique  is  based  on  the  convolution  after  training  of  GaRBF 
networks  with  gaussian  filters  and,  consequently,  is  independent  of  the  learning  stage  and  al¬ 
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1  Introduction 

Nowadays  generalisation  and  regularisation  are  two  of  the  most  challenging  topics 
in  neural  computation.  By  generalisation  it  is  generally  understood  that,  from  a 
given  training  set  consisting  of  input-output  observations  i  =  l,...,n, 

of  an  underlying  event  such  that  =  yi,  it  is  desired  to  construct  an 

estimated  map  F  which,  for  a  new  test  set  of  input  observations  {x^}  will  provide 
a  good  prediction  for  the  unobserved  output  observations  {yj}. 

One  way  of  achieving  generalisation,  known  as  regularisation  or  smoothing,  is  to 
find  the  mapping  F  by  means  of  a  neural  network,  subject  to  some  constraints  on 
the  solution.  One  possible  constraint  is  to  limit  the  number  of  units  in  the  neural 
network,  or  to  employ  pruning  techniques  during  learning  in  order  to  limit  the 
degree  of  freedom  of  F  and  thus  avoid  overfitting  of  the  observations  (see  [1]  for  a 
survey).  A  second  class  of  regularisation  techniques  involves  the  determination  of  a 
mapping  F  in  the  c?-dimensional  Hilbert  space  H  of  functions  with  continuous  first 
derivatives  and  square  integrable  second  derivatives  which  minimise, 

=  (1) 

Z  =  1 

subject  to  the  regularisation  condition 

J (f  (2)(x))2rfx  <  K  (2) 

for  different  values  of  a  regularisation  parameter  k  (see  [2]  and  references  contained 
therein).  Equation  (1)  is  known  as  the  Mean  Squared  Error  (MSB)  of  the  training 
set  and  (2)  as  the  regularisation  constraint,  which  is  equal  to  the  energy  of  the 
second  derivative  of  the  mapping  E.  If  «  is  made  large  the  F  will  just  interpolate 
the  training  data  and,  conversely,  if  it  is  chosen  very  small  then  the  mapping,  al¬ 
though  smooth,  will  not  represent  the  underlying  event  F^.  A  good  regularisation  is 
achieved  by  adapting  k  to  reach  the  convenient  tradeoff  between  the  fitting  and  the 
smoothness  of  the  solution.  Normally  this  is  achieved  by  choosing  different  values 
for  K,  training  under  these  k  constraints,  and  testing  the  generalisation  error  results 
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on  a  test  set  by  cross-validation.  Although  this  solution  may  generalise  well,  it  is 
computationally  expensive  and  needs  retraining  for  each  of  the  possible  values  of  k. 
Moreover,  the  final  results  may  be  biased  by  the  capacity  of  the  neural  network  or 
the  training  rule  to  escape  from  local  minima.  In  this  paper,  we  propose  a  new  reg¬ 
ularisation  technique  based  on  convolution  after  training  a  Gaussian  Radial  Basis 
Function  (GaRBF)  network  with  gaussian  filters.  This  technique,  which  does  not 
need  retraining,  consists  in  essence  of  the  following  steps: 


■  Training  a  GaRBF  network  with  a  large  number  of  units. 

■  Convolve  the  network  with  gaussian  filters  of  different  widths  and  normalise 
the  network  so  that  the  final  Li-norm  of  the  convolved  F(x)  remains  the  same 
(this  implicitly  leads  to  a  reduction  of  the  energy  of  the  second  derivative  of 
F(x)). 

■  Verify  the  generalisation  performance  of  the  convolved  GaRBF  networks  by 
cross-validation,  and  retain  the  best  solution. 

The  paper  is  organised  as  follows:  Section  2  presents  the  mathematical  principles 
of  this  technique.  Section  3  describes  the  main  algorithm  for  regularisation  and  a 
binary  technique  for  searching  for  the  best  gaussian  filter.  Section  4  gives  an  exper¬ 
imental  evaluation  of  the  regularisation  technique  on  the  regression  of  a  synthetic 
problem. 

2  GaRBF  Network  Convolution  with  Gaussian  Filters 

GaRBF  networks  are  defined  as  a  linear  combination  of  Gaussian  units.  Let  this  be 
denoted  by. 


N 


where  N  is  the  number  of  GaRBF  units  of  general  expression,  fi{x)  —  e  , 

and  with  amplitude  oji  6  ,  centre  m  e  and  width  cr*  G  . 

x2 

- 2 

Let  g(Xj  a)  be  a  Gaussian  filter  given  by,  ^(x,  cr)  =  e  ,  with  (Tj  as  width.  We  first 
define  the  integral  convolution  in  the  Hilbert  space  Ti  of  these  two  functions,  /i(x) 
and  ^(x,  a)  as 

/oo 

fi(x)g{T  -x,(7)dx  (4) 

-oo 

Since  GaRBFs  are  bounded  and  absolutely  integrable  on  H,  the  computation  of 
their  convolution  may  more  easily  be  done  by  means  of  the  convolution  property, 

where  T{-}  is  the  Fourier  transform.  Fourier  trans¬ 
forms  of  fi  and  g  respectively  are 


(5) 
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and, 


^■2^2 

T{g}{ijo)  =  afV7^e~  ^  (6) 

Finally,  from  the  inverse  Fourier  transform  of  their  product,  a  new  GaRBF  of  same 
location  but  different  amplitude  and  width  is  obtained, 


+  a] 


' 

7r)2  e 


(7) 


The  integral  convolution  of  a  GaRBF  network  F{x)  as  defined  in  (3)  and  a  gaussian 
filter  ^(x,  a)  is  a  straightforward  consequence  of  equation  (7),  since, 


/oo  N  _oo 

F{x)-g(T-x,(7)d-x.=  '^ai  fi(x)  ■  g{r  -  x,cr)dx  (8) 

■OO  J-oo 

and  is  equal  to  a  new  GaRBF  network  with  same  number  of  units,  located  at  the 
same  positions  but  with  different  amplitude  and  width.  In  the  particular  context 
of  regularisation  by  convolution,  it  is  of  great  importance  to  keep  the  convolved 
function  h(r)  as  close  as  possible  to  the  original  F(x)  except  at  those  points  where 
high  frequencies  have  been  filtered.  This  may  be  achieved  by  requiring  the  Ti-norm 
of  the  GaRBF  network  to  remain  constant.  A  good  alternative  is  to  calculate  the 
Li-norm  of  the  gaussian  filter,  as  follows. 


/•oo  _x^  ^ 

I  e  drc  =  (tt)  2  (9) 

J —00 

and  then  convolve  the  network  with  a  normalised  filter  which  leads  to  the  same 
results  as  the  normalisation  of  the  GaRBF  network.  Thus,  the  Ti-normalised  con¬ 
volution  F{x)  (g)  g{x,  a)  of  a  GaRBF  network  F(x)  and  a  gaussian  filter  g{x,  a) 
which  retains  the  same  norm  of  the  original  network  may  then  be  stated  as. 


F(x)<^g{x,<7) 


IZo  F('^)9(t  -  ■x.,cr)dx 


(10) 


3  The  Regularising  Technique 

This  section  describes  the  regularising  technique  based  on  the  convolution  of  GaRBF 
units  with  gaussian  filters,  developed  in  previous  section.  The  technique  is  to  be 
applied  to  a  pre- trained  GaRBF  network  as  defined  in  equation  (3),  and  uses  the 
Root  Mean  Squared  (RMS)  error  obtained  on  a  cross-validation  set  after  regulari¬ 
sation  as  a  performance  criterion.  The  best  regularising  gaussian  filter  is  found  by 
binary  search  under  the  assumption  that  the  RMS  error  has  no  local  minima  in 
the  search  space.  Reasonable  bounds  for  the  binary  searching  are  easily  obtained 
from  the  bounded  space  on  which  the  underlying  function  F^  lies.  Thus  the  minimal 
bound  for  the  width  filter  is  equal  to  zero  and  the  maximal  bound  is  equal  to  the 
maximal  distance  between  two  points  of  the  bounded  space  H. 
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The  structure  of  the  related  regularisation  algorithm  and  its  initialisation  is  shown 
below  as  pseudo-code, 


■  Initial  parameters  and  data 

-  Neural  Network  :  F{x)  — 

-  Cross  Validation  set  :  CV  =  {xj ,yj}  j  =  I, ...  ,n 

-  Gaussian  Filter  :  g{x,  (t)  = 

—  Bounds  of  Gaussian  filter  widths  and  their  average: 

—  0 

=  (max  (E^=i  Vx  6  CV 

^avg  —  {(^max  +  ^min  )/2 

-  RMS  errors  obtained  on  the  CV  data  set  after  convolving  F{x)  with 

^avg)  ^nd  g{x,  (Tmax)- 

RM Sjyiin  y  RMSavg  S'Ud  RM Smax 

-  Stop  threshold  for  the  binary  search  :  e 

■  Regularising  Algorithm 

Loop  until  I  RMSmax  -  RMSmin  \<  ^ 

1.  Calculate  the  RMS\  and  RMS2  errors  obtained  after  convolution  of 
F{x)  with  g{x,  a),  for  widths  (Ti  and  (J2,  where, 

(7\  ~  {cTmin  d"  ^avg^  /2 
<^2  =  {<^avg  d"  ^max)  /2 

2.  If  RMSi  <  RMS2  then, 

^max  —  ^avg  ^  ^avg  —  ^1 

RMSmax  =  RMSavg  &  RMSavg  =  RM  Si 
else 

<^min  —  ^avg  ^  ^avg  —  ^2 

RMSmin  =  RMSavg  &  RMSavg  =  RM S2 
end  loop 

4  Numerical  Results 

In  order  to  illustrate  the  performance  of  Regularisation  by  Convolution,  we  have 
applied  the  algorithm  to  Wahba^s  synthetic  problem  [2],  This  problem  consists  of 
the  regression  of  a  noisy  function  generated  synthetically  according  to  the  model, 
F^{x)  =  4.26  (e~^®  -  -f-  -t- 1/,  where  n  is  normally  distributed  random 

noise  with  zero  mean  and  standard  deviation  0.2.  In  the  original  problem,  100  noisy 
observations  were  generated  and  used  as  learning  set  for  the  training  of  a  sigmoid 
feed-forward  neural  network.  Wahba  performed  regularisation  during  training  us¬ 
ing  equation  (2).  The  k  value  which  provided  the  lowest  RMS  error  was  obtained 
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Figure  1  Original  Wahba’s  function  (dash-dot).  Neural  Network  regression 
(solid).  CV  observations  (circles),  a)  LRAN  regression  with  RMS  error  equal  to 
0.3011  on  the  CV  data  set.  b)  The  same  LRAN  after  regularisation  by  convolution 
with  the  best  Gaussian  filter  obtained  by  binary  search  {(Tavg  -  0.1709)  with  RMS 
error  equal  to  0.2116. 


by  leaving- one- out  cross-validation.  Thus  retraining  was  needed  for  each  different 
choice  of  k. 

In  our  case  we  applied  Regularisation  by  Convolution  to  the  same  problem  af¬ 
ter  training  a  Limited  Resource  Allocating  network  (LRAN)  by  the  F-Projections 
learning  rule  [3].  Training  was  only  performed  once,  after  which  the  network  had 
learnt  the  underlying  problem  with  over-fitting  of  the  noisy  data  as  shown  in 
Figure  (l.a).  The  original  training  data  was  interpolated  by  a  factor  often  to  pro¬ 
vide  a  sufficiently  representative  training  set  for  the  noisy  problem,  and  a  LRAN 
with  200  units  (twice  the  number  of  training  samples)  was  used  to  ensure  learning 
with  overfitting.  We  used  an  extra  set  of  100  samples  as  Cross-Validation  set  CV  to 
calculate  the  best  regularising  filter.  The  stop  threshold  e  was  fixed  at  10~^.  Figure 
(l.b)  illustrates  the  final  result. 

5  Summary 

This  paper  has  shown  how  a  technique  based  on  a  convolution  operator  can  be 
succesfully  applied  for  regularising  and  then  improving  the  performances  on  gen¬ 
eralisation  of  Gaussian  RBF  networks.  The  advantage  of  this  technique  over  other 
approaches  is  that  it  is  independent  of  training  processes  and  algorithms. 
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In  this  paper  we  study  the  problem  of  the  prediction  of  autonomous  continuous-time  dynamical 
systems  from  discrete-time  measurements  of  the  state  variables.  We  show  that  the  predictor  of  such 
a  system  needs  to  be  an  invertible  map  from  the  state-space  into  itself.  The  problem  then  becomes 
one  of  how  to  approximate  invertible  maps.  We  show  that  standard  approximation  schemes  do 
not  guarantee  the  property  of  invertibility.  We  therefore  propose  a  new  approximation  scheme 
based  on  the  composition  of  invertible  basis  functions  which  preserves  invertibility.  This  approach 
can  be  cast  in  a  Lie  algebraic  framework  where  the  approximation  is  based  on  the  use  of  the 
Baker-Campbell-Hausdorff  formula.  The  method  is  implemented  in  a  neural-like  form.  We  also 
present  a  more  general  implementation  which  we  call  ”  MLP  in  dyneimics  space” . 

Keywords:  System  prediction,  Lie  adgebras,  Baker-CampbeU-Hausdorff  formula. 

1  Introduction 

We  will  study  here  the  problem  of  the  prediction  of  nonlinear  dynamical  systems 
from  a  set  of  sampled  measurements  of  the  state  of  the  system.  The  nonlinear 
systems  which  we  will  consider  are  governed  by  a  multidimensional  ordinary  dif¬ 
ferential  equation.  Although  the  dynamics  of  the  system  is  continuous  in  time,  the 
samples  that  we  will  use  for  prediction  are  the  state  variables  measured  at  discrete 
time-intervals  only.  We  are  therefore  led  to  postulate  a  discrete- time  form  for  our 
prediction:  x{k  +  1)  =  F{x{k)).  In  the  first  section,  we  will  show  -  using  the  fact 
that  the  underlying  dynamics  of  the  system  is  time-continuous  -  that  the  function 
F  used  for  prediction  should  be  invertible.  The  problem  then  becomes  that  of  build¬ 
ing  an  approximation  scheme  specifically  designed  for  invertible  maps.  The  main 
drawback  of  standard  approximation  schemes  is  that  they  are  based  on  summing 
basis  functions  to  build  an  approximation.  However,  this  operation  does  not  pre¬ 
serve  invertibility.  Using  the  operation  of  composition  instead  of  addition,  we  are 
able  to  guarantee  invertibility.  We  will  thus  build  approximation  schemes  based 
on  composition  of  invertible  basis  functions.  In  the  second  section,  we  will  analyze 
those  compositions  within  the  framework  of  Lie  algebra  theory.  Starting  from  a 
method  from  numerical  analysis  we  will  be  able  to  derive  a  new  approach  to  the 
prediction  of  nonlinear  systems.  In  section  three,  this  new  method  will  be  presented 
in  a  neural-like  form  and  illustrated  by  an  example.  We  will  also  present  a  general 
model  for  approximation  which  we  call  ”MLP  in  dynamics  space”.  Finally,  we  will 
conclude. 

1.1  Invertibility  of  Dynamical  Systems 

The  systems  considered  here  are  defined  by  an  ordinary  differential  equation  (ODE) 
on  some  n-dimensional  manifold  Al.  We  follow  the  differential  geometric  description 
of  dynamical  systems  [2]  and  assume  the  standard  smoothness  assumptions.  For 
each  initial  condition  xq,  we  can  solve  the  initial  value  problem.  We  then  collect  all 
those  solutions  in  one  function  which  gives  the  solution  of  the  ODE  as  a 
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function  of  the  initial  condition: 

r  x  =  f{x{t)) 

<  (1) 

[  a:(0)  =  aro 

^  is  called  the  flow  of  the  dynamical  system.  The  flow  is  a  remarkable  object  which 
possesses  the  group  property.  We  refer  the  reader  to  [4]  for  a  general  presentation 
of  the  group  theoretic  approach  to  dynamical  systems.  If  we  assume,  for  simplicity, 
that  the  flow  is  defined  for  all  times,  the  group  property  writes  as  follows: 

^  :  Ad  X  IR  — >  A4 

(2) 

^{x,t -\- s)  =  \/t,seIR 

As  we  will  see,  this  gives  the  flow  the  structure  of  a  group.  We  define  the  time-^ 
map  as 

V?‘(a:o)  =  (3) 

This  map  <p^  associates  to  any  initial  condition  its  image  under  the  flow  of  the 
differential  equation  after  a  given  time  t.  The  map  (p^  is  a  map  from  the  manifold 
M  into  itself.  Using  the  group  property  of  the  flow,  we  can  deduce  the  properties 
of  (p: 

op^{xo)  =  s),t)  =  $(xo,^  +  s)  =  composition 

—  I  ^  =  ^(iCo,  0)  =  270  =  ^(a^o)  identity  (4)  ^ 

p^  o  ^”*(270)  =  ^(a7o,  0)  =  270  ==  -^(^70)  p~^  =■  {p^)"^  inverse 

Moreover,  if  /  is  then  is  C'’.  Hence,  since  ^  is  a  smooth  invertible  map  with  a 
smooth  inverse,  ^  is  a  diffeomorphism.  Furthermore,  the  properties  of  p  show  that 
{p^}  is  a  group.  We  therefore  say  that  a  flow  ^{x,t)  is  a  one-parameter  group  of 
diffeomorphisms  p^ . 

1.2  Dynamical  System  Prediction 

Even  if  we  assume  that  the  underlying  dynamics  of  the  system  is  governed  by  an"^ 
ODE,  the  system  is  generally  observed  only  at  discrete  time-intervals.  So,  if  we 
choose  a  sampling  period  At,  we  can  assume  a  functional  relationship  between  the 
current  state  and  the  next  sample.  The  system  becomes  of  the  form: 

27((^-M).A/)  =  F(a7(^.A/)),  x{k)eM  (5) 

And,  from  what  we  have  shown  before,  we  see  that  E  is  a  diffeomorphism  as 

=  (6) 
The  following  figure  (Fig.  1)  summarizes  those  observations.  On  the  left  we  con¬ 
sider  a  square  grid  of  initial  conditions.  We  let  each  point  of  that  grid  follow  its 
trajectory  for  a  time-interval  At.  After  this  time,  all  the  points  of  the  grid  have 
moved  and  the  shape  of  the  grid  has  changed.  The  diffeomorphism  p^^  is  that 
transformation  from  the  manifold  (here,  IR^)  into  itself  which  maps  the  square  grid 
on  the  left  into  the  deformed  grid  on  the  right.  One  can  see  that  this  transformation 
is  smooth  and  invertible,  hence  a  diffeomorphism.  So,  the  question  of  predicting  a 
continuous-time  system  observed  at  discrete  time-intervals  becomes  one  of  how  we 
can  approximate  those  diffeomorphisms  which  arise  from  the  sampling  of  an  ODE. 
The  basic  idea  is  that  the  set  of  all  C"’  diffeomorphisms  is  also  a  group.  In  a  group, 
the  natural  operation  is  the  composition  of  the  elements  of  the  group.  While  most 
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Figure  1  DifFeomorphic  transforma-  Figure  2  Implementation  of  the  com- 

tion  of  a  square  grid  after  one  time-step.  position  network. 


standard  approximation  schemes  are  based  on  the  summation  of  some  basis  func¬ 
tions,  this  operation  does  not  preserve  the  property  of  being  a  diffeomorphism.  On 
the  other  hand,  composition  does  preserve  this  property,  so  we  propose  to  build  an 
approximation  scheme  based  on  composing  basis  functions  instead  of  adding  them. 


2  Lie  Algebra  Theory 

Lie  algebra  theory  has  a  long  history  in  physics,  mostly  in  the  areas  of  classical 
mechanics  [1]  and  partial  differential  equations  [8].  It  is  also  an  essential  part  of 
nonlinear  system  theory  [5].  Our  approach  to  system  prediction  can  be  best  cast  in 
this  framework.  A  Lie  algebra  is  a  vector  space  where  we  define  a  supplementary 
operation:  the  bracket  [., .]  of  two  elements  of  the  algebra.  The  bracket  is  an  oper¬ 
ation  which  is  bilinear,  antisymmetric  and  satisfies  the  Jacobi  identity  [8].  In  the 
case  where  the  Lie  algebra  is  the  vector  space  of  all  vector  fields,  the  time-;i  map 
plays  a  very  important  role.  If  the  vector  field  is  A,  we  define  the  exponential 
map  as  follows:  exp(t.A)  =  This  notation  is  an  extension  of  the  case  where  the 
vector  field  is  linear  in  space  and  where  the  solution  is  given  by  exponentiation  of 
the  matrix.  The  exponential  is  thus  a  mapping  from  the  manifold  into  itself.  And 
this  map  depends  on  the  underlying  vector  field  A  and  changes  over  time.  From 
here  on,  the  product  of  exponentials  will  denote  the  composition  of  the  maps.  One 
can  then  define  the  Lie  bracket  by 

[A,  B]  =  ^  exp(-s.S).  exp(-tA).  exp(s.B).  exp{t.A)  (7) 

OS.Ot 

In  the  case  where  the  manifold  is  IR" ,  the  bracket  can  be  further  particularized  to 


i=l  ^ 


dAj 

dxi 


(8) 


2.1  The  Baker-Campbell-HausdorfF  Formula 

The  Baker-Campbell-Hausdorff  (BCH)  formula  gives  an  expansion  for  the  product 
of  the  exponentials  of  two  elements  of  the  Lie  algebra,  see  [6] : 

exp(A). exp(S)  =  exp(A  +  B  +  1  [yl,  S]  +  ^([.4,  [A,  S]]  -  [B,  [B,  A]])  +  . . .)  (9) 

The  problem  of  predicting  a  dynamical  system  with  vector  field  X  then  becomes 
that  of  building  an  approximation  for  exp(ALX)  as  we  have  that  x{k  -f  1)  = 
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exp(A^.X)[a:(A:)].  This  problem  has  recently  been  the  focus  of  much  attention  in  the 
field  of  numerical  analysis  for  the  integration  of  Hamiltonian  differential  equations 
[6]  [7].  Suppose  that  the  vector  field  X  is  of  the  form  X  =  A  -\-  B  where  we  can 
integrate  A  and  B  directly.  We  can  use  the  BCH  formula  to  produce  a  first-order 
approximation  to  the  exponential  map: 

BCH  :  exp(A<.A)  =  exp(At.A).  exp(At.5)  -f-  o{AJ?)  (10) 

This  is  the  essence  of  the  method  as  it  shows  that  one  can  approximate  an  expo¬ 
nential  map  (that  is  the  map  arising  from  the  solution  of  an  ODE)  by  composing 
simpler  maps.  By  repeated  use  of  the  BCH  formula,  we  can  show  that  the  following 
leapfrog  scheme  is  second-order. 

Leapfrog  :  exp(At.X)  =  exp(^.A).  exp(AtB).  exp(^.A)  -f  o(At^)  (11) 

Using  this  leapfrog  scheme  as  a  basis  element  for  further  leapfrog  schemes,  Yoshida 
[9]  showed  that  it  was  possible  to  produce  an  approximation  to  exp(At.X)  up  to 
any  order.  Forest  and  Ruth  [3]  showed  that  approximations  could  be  built  for  more 
than  two  vector  fields.  Combining  the  two  we  can  state  that  it  is  possible  to  build  an 
approximation  to  the  solution  of  a  linear  combination  of  vector  fields  as  a  product 
of  exponential  maps: 


3wij  :  exp(At.X)  =  ni=i  Ili^i  exp(u;ij  .At.A*)  -f-  o(AtP+^) 


(12) 


3  Network  Implementation 

Such  an  approximation  scheme  can  be  easily  implemented  as  a  multilayer  network 
as  can  be  seen  in  the  following  figure  (Fig.  2).  The  problem  of  predicting  the  system 
can  now  simply  be  solved  by  minimizing  the  prediction  error  of  the  model  by  tuning 
the  weights  Wij  using  some  gradient-based  optimization  technique. 

3.1  Example 

We  will  now  present  an  example  to  illustrate  how  this  method  can  be  applied.  We 
will  do  this  by  looking  at  the  Van  den  Pol  oscillator.  This  system  can  be  written  as 
a  first-order  system  in  state-space  form  which  can  be  seen  as  a  linear  combination 
of  two  vector  fields  Ai,A2  which  can  be  solved  analytically: 


(j)  -  (-0 


+  a.Aj 


(13) 


One  can  then  build  a  predictor  according  to  the  architecture  just  presented  (Fig.  2), 
simply  choosing  an  appropriate  number  of  layers.  One  then  finds  the  appropriate 
weights  Wij  by  minimizing  the  error  on  the  training  set.  One  interesting  property 
of  this  method  is  that  we  are  able  to  use  the  a  priori  knowledge  we  have  about  the 
system.  Such  a  feature  is  uncommon  in  the  field  of  system  prediction  using  neural 
networks.  This  method  allowed  us  to  build  a  predictor  for  the  Van  der  Pol  oscillator 
in  the  case  where  the  parameter  a  was  unknown.  We  also  produced  a  similar  result 
for  the  Lorenz  attractor. 

3.2  MLP  in  Dynamics  Space 

If  we  decide  to  approximate  the  vector  field  of  our  system  by  a  multi-layer  per- 
ceptron,  one  can  derive  a  composition  network  to  implement  this.  The  form  of  the 
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vector  field  is 

=  +  ^jo)  =  (14) 

j=l  j-l 

where  (t{x)  =  tanh(a))  and  Cj,  bj  G  j  =  1, 

But  the  differential  equation  x  =  o'(x)  can  be  solved  explicitly  in  the  one- dimensional 
case.  We  can  use  this  to  explicitly  integrate  the  multidimensional  system  x  = 
A^’^{x)  for  any  value  of  the  parameters  cj^bj.  So,  we  can  design  a  network  of  the 
following  form: 

x{k  -|-  1)  =  F(x{k))  =  exp{wk.A^^’^^)  o  . . ,  o  exp(t(;i.yl^^’^^)(ic(A:))  (15) 

We  call  this  an  “MLP  in  dynamics  space”  as  the  MLP  is  implicitly  used  to  parametrize 
the  vector  field.  Further  work  will  be  devoted  to  finding  computationally  efficient 
methods  for  the  training  of  such  a  network. 

4  Conclusions 

We  have  studied  the  problem  of  predicting  an  autonomous  continuous-time  non¬ 
linear  dynamical  system  from  discrete-time  measurements  of  its  state.  We  showed 
that  because  the  predictor  must  have  the  form  of  an  invertible  map,  it  is  interesting 
to  have  an  approximation  scheme  which  has  this  property.  This  can  be  achieved 
by  using  compositions  instead  of  additions  in  the  approximation.  The  theory  of  Lie 
algebras  provided  the  framework  to  develop  such  a  method.  We  showed  that  this 
resulted  in  a  multi-layer  architecture  and  we  illustrated  it  by  a  simple  example. 
Further,  the  method  was  extended  to  some  form  of  ” MLP  in  dynamics  space”. 
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We  present  here  a  method  for  the  study  of  stochastic  neurodynamics  in  the  master  equation  frame¬ 
work.  Our  aim  is  to  obtain  a  statistical  description  of  the  dynamics  of  fluctuations  and  correlations 
of  neural  activity  in  large  neural  networks.  We  focus  on  a  macroscopic  description  of  the  network 
via  a  master  equation  for  the  number  of  active  neurons  in  the  network.  We  present  a  systematic 
expansion  of  this  equation  using  the  “system  size  expansion” .  We  obtain  coupled  dynamical  equa¬ 
tions  for  the  average  activity  and  of  fluctuations  around  this  average.  These  equations  exhibit 
non-monotonic  approaches  to  equilibrium,  as  seen  in  Monte  Carlo  simulations. 

Keywords:  stochastic  neuro dynamics,  master  equation,  system  size  expansion. 

1  Introduction 

The  correlated  firing  of  neurons  is  considered  to  be  an  integral  part  of  informa¬ 
tion  processing  in  the  brain[12,  2].  Experimentally,  cross-correlations  are  used  to 
study  synaptic  interactions  between  neurons  and  to  probe  for  synchronous  net¬ 
work  activity.  In  theoretical  studies  of  stochastic  neural  networks,  understanding 
the  dynamics  of  correlated  neural  activity  requires  one  to  go  beyond  the  mean  field 
approximation  that  neglects  correlations  in  non-  equilibrium  states[8,  6].  In  other 
words,  we  need  to  go  beyond  the  simple  mean-field  approximation  to  study  the 
effects  of  fluctuations  about  average  firing  activities. 

Recently,  we  have  analyzed  stochastic  neurodynamics  using  a  master  equation[5,  8]. 
A  network  comprising  binary  neurons  with  asynchronous  stochastic  dynamics  is 
considered,  and  a  master  equation  is  written  in  “second  quantized  form”  to  take 
advantage  of  the  theoretical  tools  that  then  become  available  for  its  analysis.  A 
hierarchy  of  moment  equations  is  obtained  and  a  heuristic  closure  at  the  level  of 
second  moment  equations  is  introduced.  Another  approach  based  on  the  master 
equation  via  path  integrals,  and  the  extension  to  neurons  with  a  refractory  state 
are  discussed  in  [9,  10]. 

In  this  paper,  we  introduce  another  master  equation  based  approach  to  go  beyond 
the  mean  field  approximation.  We  concentrate  on  the  macroscopic  behavior  of  a 
network  of  two-state  neurons,  and  introduce  a  master  equation  for  the  number  of 
active  neurons  in  the  network  at  time  t.  We  use  a  more  systematic  expansion  of 
the  master  equation  than  hitherto,  the  “system  size  expansion”  [11].  The  expansion 
parameter  is  the  inverse  of  the  total  number  of  the  neurons  in  the  network.  We 
truncate  the  expansion  at  second  order  and  obtain  an  equation  for  fluctuations 
about  the  mean  number  of  active  neurons,  which  is  itself  coupled  to  the  equation 
for  the  average  number  of  active  neurons  at  time  t.  These  equations  show  non¬ 
monotonic  approaches  to  equilibrium  values  near  critical  points,  a  feature  which  is 
not  seen  in  the  mean  field  approximation.  Monte  Carlo  simulations  of  the  master 
equation  itself  show  qualitatively  similar  non-monotonic  behavior. 
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2  Master  Equation  and  the  System  Size  Expansion 

We  first  construct  a  master  equation  for  a  network  comprising  N  binary  elements 
with  two  states,  “active”  or  firing  and  “quiescent”  or  non-firing.  The  transitions 
between  these  states  are  probabilistic  and  we  assume  that  the  transition  rate  from 
active  to  quiescent  is  a  constant  a  for  every  neuron  in  the  network.  We  do  not 
make  any  special  assumption  about  network  connectivity,  but  assume  that  it  is 
“homogeneous”,  i.e.,  all  neurons  are  statistically  equivalent  with  respect  to  their 
activities,  which  depend  only  on  the  proportion  of  active  neurons  in  the  network. 
More  specifically,  the  transition  rate  from  quiescent  to  active  is  given  as  a  function 
<j)  of  the  number  of  active  neurons  in  the  network.  Taking  the  firing  time  to  be 
about  2ms,  we  have  a  500s” ^  For  the  transition  rate  from  quiescent  to  active, 
the  range  of  the  function  is  «  30  -  100s“^  reflecting  empirically  observed  firing- 
rates  of  cortical  neurons.  With  these  assumptions,  one  can  write  a  master  equation 
as  follows. 

t]  =  a{nPN[n,  t]  -  (n  +  l)Piv[(n  +  1),  t]) 

01 

+  N(l  -  <]  -  JV(1  -  -  1),  t],  (1) 

where  Pjv[n,  t]  is  the  probability  that  the  number  of  active  neurons  is  n  at  time  t. 
(We  absorbed  the  parameter  representing  total  synaptic  weight  into  the  function 
(f).)  This  master  equation  can  be  deduced  from  the  second  quantized  form  cited 
earlier,  which  will  be  discussed  elsewhere.  The  standard  form  of  this  equation  can 
be  rewritten  by  introducing  the  “step  operator” ,  defined  by  the  following  action  on 
an  arbitrary  function  of  n: 

£f{n)  =  f{n  +  1),  £~^f(n)  -  f(n  -  1)  (2) 

In  effect,  £  and  shift  n  by  one.  Using  such  step  operators,  Eq.  (1)  becomes 

^Pjv[n,  t]  =  {£-  l)r„Pw[n.  i]  +  {£-^  -  l)gnPN[n,  <],  (3) 

where  =  an,  and  =  (N  -  n)<f){^).  This  master  equation  is  non-linear  since 
gn  is  a  nonlinear  function  of  n.  Linear  master  equations,  in  which  both  and  gn 
are  linear  functions  of  n,  can  be  solved  exactly.  However,  in  general,  non-linear 
master  equations  cannot  be  solved  exactly,  so  in  our  case,  we  seek  an  approximate 
solution. 

We  now  expand  the  master  equation  to  obtain  approximate  equations  for  the 
stochastic  dynamics  of  the  network.  We  use  the  system  size  expansion,  which  is 
closely  related  to  the  Kramers-Moyal  expansion,  to  obtain  the  “macroscopic  equa¬ 
tion”  and  time-dependent  approximations  to  fluctuations  about  the  solutions  of 
such  equation.  In  essence,  this  method  is  a  way  to  expand  master  equations  in 
powers  of  a  small  parameter,  which  is  usually  identified  as  the  inverse  size  of  the 
system.  Here,  we  identify  the  system  size  with  the  total  number  of  neurons  in  a 
network. 

We  make  a  change  of  variables  in  the  master  equation  given  in  (3).  We  assume  that 
fluctuations  about  the  macroscopic  value  of  n  are  of  order  In  other  words, 

we  expect  that  PN(n,t)  will  have  a  maximum  around  the  macroscopic  value  of  n 
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with  a  width  of  order  iVO/2)^  Hence,  we  set 

n(t)  =  N^{t)  +  N-2ai)  (4) 

where  fi  satisfies  the  macroscopic  equation  and  ^  is  a  new  variable,  whose  distribu¬ 
tion  is  equivalent  to  Pwin,  t),  i.e.,  i)  =  n(^,  t).  We  expand  the  step  operators 

as: 

as  S  shifts  ^  by  ^  With  this  change  of  variables,  the  master  equation  is 

given  as  follows: 

. . .)[«(;.  +  Af-^on] 

+N{-N-i^  +  . .  .)[1  -(ti  +  +  N-iO]Tl(^,i)  (6) 

Collecting  terms,  we  obtain,  to  order  7VO/2)^ 

du 

— ^  -  (1  - (7) 

This  is  the  macroscopic  equation,  which  can  be  obtained  also  by  using  a  mean-filed 
approximation  to  the  master  equation.  We  make  satisfy  this  equation  so  that 
terms  of  order  vanish. 

The  next  order  is  which  gives  a  Fokker-Planck  equation  for  the  fluctuation 

=  £H^)U=i.y- 

+  (/^)]n  +  +  -p),^(^)]n  (8) 

We  note  that  this  equation  does  not  depend  on  the  variable  N  justifying  our  as¬ 
sumption  that  fluctuations  are  of  order 

We  now  study  the  behavior  of  the  equations  obtained  through  the  system  size 
expansion  to  the  second  order.  ^From  Eqs.  (7)  and  (8),  we  obtain 

=  -ax  +  (1  -  x)Hx  “  ??)  +  ?/(l  -  X  +  (x  -  rj)  (9) 


=  -aij  -  7)<f>{x  ~  ??)  +  7?(1  -  X  +  V)<l^\x  -  l) 


where  X  =  ^  and  rj  =  N~2^.  Equations  (9)  and  (10)  can  be  numerically 

integrated.  Some  examples  are  shown  in  Figure  1(B).  For  comparison,  we  plot 
solutions  of  the  macroscopic  equations  with  the  same  parameter  sets  in  Figure 
1(A).  We  observe  a  physically  expected  bifurcation  into  an  active  network  state 
with  decreasing  a,  for  either  approximation.  However,  different  dynamics  are  seen 
as  one  approaches  bifurcation  points.  In  particular,  the  coupled-equations  exhibit 
a  non-monotonic  approach  to  the  limiting  value.  The  validity  of  the  system  size 
expansion  is  limited  to  the  region  not  close  to  the  bifurcation  point,  as  discussed 
in  the  last  section.  The  point  is  that  by  incorporating  higher  order  terms  into  the 
approximation,  we  can  extend  its  validity  to  a  domain  closer  to  the  bifurcation 
point  and  thereby  better  capture  stochastic  dynamical  behavior  of  such  networks. 
Monte  Carlo  simulations  [3]  of  the  two  dimensional  network  based  on  (1)  with  2500 
neurons  with  periodic  boundary  conditions  were  performed.  The  connectivity  is 
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set  up  as  follows:  a  neuron  is  connected  to  a  specified  number  k  of  other  neurons 
chosen  randomly  from  the  network.  The  strength  of  connection  is  given  by  the 
Poisson  form:  ^ 

Wij  =  (11) 


where  rij  is  the  distance  between  two  neurons,  and  wq  and  s 


are  constants.  We 

show  in  Figure  2  the  average  behavior  of  x  for  (A)  k  =  200  {k/N  —  0.08)  and  (B) 
k  =  15m  (k/N  =  0.006).  The  non-monotonic  dynamics  is  more  noticeable  in  the 
low  connectivity  network.  More  quantitative  comparisons  between  simulations  and 
theory  will  be  carried  out  in  the  future.  The  qualitative  comparison  shown  here, 
however,  indicate  the  need  to  model  fluctuations  of  total  activity  near  critical  points 
in  order  to  capture  the  dynamics  of  sparsely  connected  networks.  This  is  consistent 
with  our  earlier  investigations  of  a  one-dimensional  ring  of  neurons  via  a  master 
equation. 


Figure  1  Comparison  of  solutions  of  (A)  the  macroscopic  equation,  and  (B)  (23) 
2ind  (24),  The  parameters  are  set  at  /?  =  15,  ^  =  0.5  and  a  =  (a)  0.2,  (b)  0.493, 
and  (c)  0.9.  The  initial  conditions  are  x  =  0*5)  cind  77  =(A)0.01,  and  (B)  0. 


3  Discussion 

We  have  here  outlined  an  application  of  the  system  size  expansion  to  a  master 
equation  for  stochastic  neural  network  activity.  It  produced  a  dynamical  equation 
for  the  fluctuations  about  mean  activity  levels,  the  solutions  of  which  showed  a  non¬ 
monotonic  approach  to  such  levels  near  a  critical  point.  This  has  been  seen  in  model 
networks  with  low  connectivity.  Two  issues  raised  by  this  approach  require  further 
comment:  (1)  In  this  work  we  have  used  the  number  of  neurons  in  the  network 
as  a  expansion  parameter.  Given  the  observation  that  the  overall  connectedness 
affects  the  stochastic  dynamics,  a  parameter  representing  the  average  connectivity 
per  neuron  may  be  better  suited  as  an  expansion  parameter.  We  note  that  this 
parameter  is  typically  small  for  biological  neural  networks.  (2)  There  are  many 
studies  of  Hebbian  learning  in  neural  networks[l,  4,  7].  In  such  studies  attempts 


294 


Chapter  50 


Figure  2  Comparisons  of  Monte  Carlo  simulations  of  the  master  equation  with 
high  (A:  =  200)  and  low  (A:  =  15)  connectivities  per  neuron.  The  parameters  are  set 
at  /3  =  15,  0  =  0.5,  WQ  —  100.0,  s  =  3.0,  and  for  (A)  a  =  (a)  0.05,  (b)  0.1,  and 
(c)  0.4,  and  for  (B)  a  —  (a)  0.001,  (b)  0.05,  and  (c)  0.2.  The  initial  conation  is  a 
random  configuration. 


have  also  been  made  to  incorporate  correlations  of  neural  activities.  It  is  of  interest 
to  see  if  we  can  formulate  such  attempts  within  the  framework  presented  here. 
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In  the  Bayesian  framework,  predictions  for  a  regression  problem  axe  expressed  in  terms  of  a 
distribution  of  output  values.  The  mode  of  this  distribution  corresponds  to  the  most  probable 
output,  while  the  uncertainty  associated  with  the  predictions  can  conveniently  be  expressed  in 
terms  of  error  bars  given  by  the  standeird  deviation  of  the  output  distribution.  In  this  paper  we 
consider  the  evaluation  of  error  bars  in  the  context  of  the  class  of  generalized  linear  regression 
models.  We  provide  insights  into  the  dependence  of  the  error  bars  on  the  location  of  the  data 
points  and  we  derive  an  upper  bound  on  the  true  error  bars  in  terms  of  the  contributions  from 
individual  data  points  which  are  themselves  easily  evaluated. 

1  Introduction 

Many  applications  of  neural  networks  are  concerned  with  the  prediction  of  one  or 
more  continuous  output  variables,  given  the  values  of  a  number  of  input  variables. 
As  well  as  predictions  for  the  outputs,  it  is  also  important  to  provide  some  measure 
of  uncertainty  associated  with  those  predictions. 

The  Bayesian  view  of  regression  leads  naturally  to  two  contributions  to  the  error 
bars.  The  first  arises  from  the  intrinsic  noise  on  the  target  data,  while  the  second 
comes  from  the  uncertainty  in  the  values  of  the  model  parameters  as  a  consequence 
of  having  a  finite  training  data  set  [1,  2].  There  may  also  be  a  third  contribution 
which  arises  if  the  true  function  is  not  contained  within  the  space  of  models  under 
consideration,  although  we  shall  not  discuss  this  possibility  further. 

In  this  paper  we  focus  attention  on  a  class  of  universal  non-linear  approximators 
constructed  from  linear  combinations  of  fixed  non-linear  basis  functions,  which  we 
shall  refer  to  as  generalized  linear  regression  models.  We  first  review  the  Bayesian 
treatment  of  learning  in  such  models,  as  well  as  the  calculation  of  error  bars  [3]. 
Then,  by  considering  the  contributions  arising  from  individual  data  points,  we  pro¬ 
vide  insight  into  the  nature  of  the  error  bars  and  their  dependence  on  the  location 
of  the  data  in  input  space.  This  in  turn  leads  to  the  key  result  of  the  paper  which  is 
an  upper  bound  on  the  true  error  bars  expressed  in  terms  of  the  single- data-point 
contributions.  Our  analysis  is  very  general  and  is  independent  of  the  particular  form 
of  the  basis  functions. 

2  Bayesian  Error  Bars 

We  are  interested  in  the  problem  of  predicting  the  value  of  a  noisy  output  variable 
t  given  the  value  of  an  input  vector  x.  Throughout  this  paper  we  shall  restrict 
attention  to  regression  for  a  single  variable  t  since  all  of  the  results  can  be  extended 
in  a  straightforward  way  to  multiple  outputs.  To  set  up  the  Bayesian  formalism  we 
begin  by  defining  a  model  for  the  distribution  of  t  conditional  on  x.  This  is  most 
commonly  chosen  to  be  a  Gaussian  function  in  which  the  mean  is  governed  by  the 
output  y{x;w)  of  a  network  model,  where  lu  is  a  vector  of  adaptive  parameters 
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(weights  and  biases).  Thus  we  have 


p{t\x,w)  = 


(27r(T2)i/2 


exp 


(y{x]w)-ty 

2cr?. 


(1) 


where  is  the  variance  of  the  distribution. 

In  the  Bayesian  framework,  our  state  of  knowledge  of  the  weight  values  is  expressed 
in  terms  of  a  distribution  function  over  w.  This  is  initially  set  to  some  distri¬ 
bution,  from  which  a  corresponding  posterior  distribution  can  be  computed  using 
Bayes’  theorem  once  we  have  observed  the  training  data.  A  common  choice  of  prior 
is  a  Gaussian  distribution  of  the  form 


P(“)=  (2) 

where  M  is  the  total  number  of  weight  parameters,  S  is  the  inverse  covariance 
matrix  of  the  distribution,  and  \S\  denotes  the  determinant  of  S.  Since  the  param¬ 
eters  in  S  control  the  distribution  of  other  parameters  they  are  often  referred  to  as 
hyperparameters.  The  noise  variance  is  commonly  also  called  a  hyperparameter 
since,  in  a  Bayesian  framework,  it  can  be  treated  using  similar  techniques  to  S. 
Here  we  shall  assume  that  the  values  of  and  S  are  fixed  and  known. 

The  training  data  set  D  consists  of  N  pairs  of  input  vectors  and  corresponding 
target  values  t^  where  n  =  1, . .  .,N.  From  this  data  set,  together  with  the  noise 
model  (1),  we  can  construct  the  likelihood  function  given  by 

p(D|w)  =  n  p(t"  la:" ,  w)  =  ^  (3) 

We  can  then  combine  the  likelihood  function  and  the  prior  using  Bayes’  theorem  to 
obtain  the  posterior  distribution  of  weights  given  by  p{w\D)  =  p{D\w)p{w) / p(D) . 
The  predictive  distribution  of  t  given  a  new  input  x  can  then  be  written  in  terms 
of  the  posterior  distribution  in  the  form 


p{t\x,D)  —  J  p{t\x,w)p{w\D)dw  (4) 

where  p(t\x,w)  is  given  by  (1). 

Throughout  this  paper  we  consider  a  particular  class  of  non-linear  models  of  the 
form 

M 

y{x;w)  =  ^Wj(l)j{x)  =  (l}'^(x)w  (5) 

i=i 


which  we  shall  call  generalized  linear  regression  models.  Here  the  (pjix)  are  a  set 
of  fixed  non-linear  basis  functions,  with  generally  one  of  the  basis  functions  (pi  = 
1  so  that  wi  plays  the  role  of  a  bias  parameter.  Such  models  possess  universal 
approximation  capabilities  for  reasonable  choices  of  the  <pj{x),  while  having  the 
advantage  of  being  linear  in  the  adaptive  parameters  w. 

Since  (5)  is  linear  in  w,  both  the  noise  model  p{t\x,  w)  and  the  posterior  distribution 
p{w\D)  will  be  Gaussian  functions  of  w.  It  therefore  follows  that,  for  a  Gaussian 
prior  of  the  form  (2),  the  integral  in  (4)  will  be  Gaussian  and  can  be  evaluated 
analytically  to  give  a  predictive  distribution  p{t\x^D)  which  will  be  a  Gaussian 
function  of  t.  The  mean  of  this  distribution  is  given  by  2/(ic;it;Mp)  where  iump  is 
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found  by  minimizing  the  regularized  error  function 

^  n=l  ^ 

and  is  therefore  given  by  the  solution  of  the  following  linear  equations 

Awmp  =  (7) 

where  t  is  a  column  vector  with  elements  A  is  the  Hessian  matrix  given  by 
1  ^  1 

A^^y4>{x")<l>{x’')'^  +  S=^^'^^  +  S  (8) 

and  ^  is  the  N  x  M  design  matrix  with  elements  ^nj  =  Solving  for  i^mp 

and  substituting  into  (5)  we  obtain  the  following  expression  for  the  corresponding 
network  output 

2/Mp(®)  =  <l>  ^{x)w  =  a~^(p  (9) 

The  covariance  matrix  for  the  posterior  distribution  p(w\D)  is  given  by  the  inverse 
of  the  Hessian  matrix.  Together  with  (4)  this  implies  that  the  total  variance  of  the 
output  predictions  is  given  by 

<r^{x)  =  (7l  +  (rl(x)  =  rrl  +  <j>  {x)A~'^ <i>{x)  (10) 

Here  the  first  term  represents  the  intrinsic  noise  on  the  target  data,  while  the  second 
term  arises  from  the  uncertainty  in  the  weight  values  as  a  consequence  of  having  a 
finite  data  set. 


3  An  Upper  Bound  on  the  Error  Bars 

We  first  consider  the  behaviour  of  the  error  bars  when  the  data  set  consists  of  a 
single  data  point.  As  well  as  providing  important  insights  into  the  nature  of  the 
error  bars,  it  also  leads  directly  to  an  upper  bound  on  the  true  error  bars. 

In  the  absence  of  data,  the  variance  is  given  from  (8)  and  (10)  by 

(7'^{x)  -al  +  (l>  (U) 

where  the  second  term,  due  to  the  prior,  is  typically  much  larger  than  the  noise 
term  <7^.  If  we  now  add  a  single  data  point  located  at  then  the  Hessian  becomes 
S-\-<T~’^d^{x^)<t>  To  find  the  inverse  of  the  Hessian  we  make  use  of  the  identity 


which  is  easily  verified  by  multiplying  both  sides  by  {M  +  vv  The  variance  at 
an  arbitrary  point  x  for  a  single  data  point  at  x^  is  then  given  by 

(13) 

where  we  have  defined  the  prior  covariance  function 

C{xyx')  =  4>^{x)S~^(l>{x')  (14) 

The  first  two  terms  on  the  right  hand  side  of  (13)  represent  the  variance  due  to 
the  prior  alone,  and  we  see  that  the  effect  of  the  additional  data  point  is  to  reduce 
the  variance  from  its  prior  value,  as  illustrated  for  a  toy  problem  in  Figure  1.  From 
(13)  we  see  that  the  length  scale  of  this  reduction  is  related  to  the  prior  covariance 
function  C{xjx'). 


298 


Chapter  51 


If  we  evaluate  o-^(x)  in  (13)  at  the  point  then  we  can  show  that  the  error 
bars  satisfy  the  upper  bound  <  2(7^.  Since  the  noise  level  is  typically  much 

less  than  the  prior  variance  level,  we  see  that  the  error  bars  are  pulled  down  very 
substantially  in  the  neighbourhood  of  the  data  point.  Again,  this  is  illustrated  in 
Figure  1. 

We  now  extend  this  analysis  to  provide  an  upper  bound  on  the  error  bars.  Suppose 
we  have  a  data  set  consisting  of  N  data  points  (at  arbitrary  locations)  and  we  add 
an  extra  data  point  at  Using  (8)  the  Hessian  Aj^+i  for  the  N-\-l  data  points 

can  be  written  in  terms  of  the  corresponding  Hessian  An  for  the  original  N  data 
points  in  the  form 

An+1  =  Ajv  +  ^(a;^'’’^)  (15) 

Using  the  identity  (12)  we  can  now  write  the  inverse  of  An+i  in  the  form 

4-1  -  ,1-1 

^  al <j)^(x^+^)Aj/-(j)(x^+^) 

Substituting  this  result  into  (10)  we  obtain 


From  (8)  we  see  that  the  Hessian  Ajv  is  positive  definite,  and  hence  its  inverse  will 
be  positive  definite.  It  therefore  follows  that  the  second  term  on  the  right  hand  side 
of  (17)  is  negative,  and  so  we  obtain 

‘^N+li^)  <  (18) 

This  represents  the  intuitive  result  that  the  addition  of  an  extra  data  point  cannot 
lead  to  an  increase  in  the  magnitude  of  the  error  bars.  Repeated  application  of  this 
result  shows  that  the  error  bars  due  to  a  set  of  data  points  will  never  be  larger  than 
the  error  bars  due  to  any  subset  of  those  data  points. 

It  can  also  be  shown  that  the  average  change  in  the  error  bars  resulting  from  the 
addition  of  an  extra  data  point  satisfies  the  bounds 

=  -  '’■at)®")]  >  (19) 


A  further  corollary  of  the  result  (18)  is  that,  if  we  consider  the  error  bars  due 
to  each  of  a  set  of  N  data  points  individually,  then  the  envelope  of  those  error 
bars  constitutes  an  upper  bound  on  the  true  error  bars.  This  is  illustrated  with  a 
toy  problem  in  Figure  1.  The  contributions  from  the  individual  data  points  are 
easily  evaluated  using  (13)  and  (14)  since  they  depend  only  on  the  prior  covariance 
function  and  do  not  require  evaluation  or  inversion  of  the  Hessian  matrix. 

4  Summary 

In  this  paper  we  have  explored  the  relationship  between  the  magnitude  of  the 
Bayesian  error  bars  and  the  distribution  of  data  in  input  space.  For  the  case  of  a 
single  isolated  data  point  we  have  shown  that  the  error  bar  is  pulled  down  close  to 
the  noise  level,  and  that  the  length  scale  over  which  this  effect  occurs  is  characterized 
by  the  prior  covariance  function.  From  this  result  we  have  derived  an  upper  bound 
on  the  error  bars,  expressed  in  terms  of  the  contributions  from  individual  data 
points. 
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Figure  1  A  simple  example  of  error  bars  for  a  one- dimensional  input  space  and 
a  set  of  30  equally  spaced  Gaussian  basis  fimctions  with  standard  deviation  0.07. 
There  are  two  data  points  at  a;  =  0.3  and  x  —  0.5  as  shown  by  the  crosses.  The 
solid  curve  at  the  top  shows  the  variance  (a;)  due  to  the  prior,  the  dashed  curves 
show  the  variance  resulting  from  taking  one  data  point  at  a  time,  and  the  lower 
solid  curve  shows  the  variance  due  to  the  complete  data  set.  The  envelope  of  the 
dashed  ciirves  constitutes  an  upper  bound  on  the  true  error  bars,  while  the  noise 
level  (shown  by  the  lower  dashed  curve)  constitutes  a  lower  bound. 
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In  this  paper  an  upper  bound  of  the  capacity  or  Vapnik-Chervonenkis  dimension  of  structured 
multi-layer  feedforward  neural  networks  with  shared  weight  vectors  is  derived.  It  mainly  depends 
on  the  number  of  free  network  parameters  analogous  to  the  case  of  feedforward  networks  with 
independent  weights.  This  means  that  weight  sharing  in  a  fixed  neural  architecture  leads  to  a 
significant  reduction  of  the  upper  bound  of  the  capacity. 

1  Introduction 

Structured  multi-layer  feedforward  neural  networks  gain  more  and  more  importance 
in  speech-  and  image  processing  applications.  Their  characteristic  is  that  a-priori 
knowledge  about  the  task  to  be  performed  is  already  built  into  their  architecture 
by  use  of  nodes  with  shared  weight  vectors.  Examples  are  time  delay  neural  net¬ 
works  [10]  and  networks  for  invariant  pattern  recognition  [4,  5].  One  problem  in 
the  training  of  neural  networks  is  the  estimation  of  the  number  of  training  sam¬ 
ples  needed  to  achieve  good  generalization.  In  [1]  is  shown  that  for  feedforward 
architectures  this  number  is  correlated  with  the  capacity  or  Vapnik-Chervonenkis 
dimension  of  the  architecture.  So  far  an  upper  bound  for  the  capacity  has  been  de¬ 
rived  for  two-layer  feedforward  architectures  with  independent  weights:  it  depends 
with  •  In  a)  number  w  of  connections  in  the  architecture  with  q  nodes 

and  a  output  elements.  In  this  paper  we  focus  on  the  calculation  of  upper  bounds 
for  the  capacity  of  structured  multi-layer  feedforward  neural  architectures.  First  we 
give  some  definitions  and  introduce  a  new  general  terminology  for  the  description 
of  structured  neural  networks.  In  section  3  we  apply  this  terminology  on  structured 
feedforward  architectures  first  with  one  layer  then  with  multiple  layers.  We  show 
that  they  can  be  transformed  into  equivalent  conventional  multi-layer  feedforward 
architectures.  By  extending  known  estimations  for  the  capacity  we  achieve  upper 
bounds  for  the  capacity  of  structured  neural  architectures  which  increase  with  the 
number  of  independent  network  parameters.  This  means  that  weight  sharing  in  a 
fixed  neural  architecture  leads  to  a  significant  reduction  of  the  upper  bound  of  the 
capacity.  The  capacity  mainly  depends  on  the  number  of  free  parameters  analogous 
to  the  case  with  independent  weights.  At  the  end  we  comment  the  results. 

2  Definitions 

A  layered  feedforward  network  architecture  Afl  a  ^  directed  acyclic  graph  with  a 
sequence  of  e  input  nodes,  r  —  1  (r  G  IN)  intermediate  {hidden)  layers  of  nodes,  and 
a  final  output  layer  with  a  nodes.  Every  node  is  connected  only  to  nodes  in  the 
next  layer.  To  every  node  k  with  indegree  n  E  IN  a  triplet  {wk,  Sk,fk)  is  assigned, 
consisting  of  a  weight  vector  Wk  E  a  threshold  value  Sk  E  IR,  and  an  activation 
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function  fk  :  IR  — {0, 1}.  The  activation  function  for  all  but  the  input  nodes 
is  the  hard  limiter  function,  and  without  loss  of  generality  we  choose  s  =  0  for 
the  threshold  values  of  all  nodes.  We  define  an  architecture  ^  with  given  triplets 
(tu,  5,  /)  for  all  nodes  as  a  net  A^J^-.With  the  net  itself  a  function  F  :  IR®  t— »■  {0, 1}*^ 
is  associated. 

Let  S  be  a  fixed  (m  x  e)-input-matrix  for  A/J a-  a  ^ 

same  (m  x  a)-output-matrix  T  are  grouped  in  a  net  class  of  A/’J  ^  related  to  S.  A(5) 
is  the  number  of  net  classes  of  A/]*, a  related  to  5.  The  growth  function  g{m)  of  an 
architecture  A/*Ja  with  m  input  vectors  is  the  maximum  number  of  net  classes  over 
all  (m  X  e)-input  matrices  S.  Now  we  consider  the  nodes  of  the  architecture  A/’J^ 
within  one  layer  (except  the  input  layer)  with  the  same  indegree  d  G  IN.  All  nodes 
k  whose  components  of  their  weight  vectors  Wk  G  IR^  can  be  permuted  through  a 
permutation  Wk  IR^  — >  IR^  so  that  TTki'UJk)  =  w  ^k  for  some  vector  w  E  IR^  are 

elements  of  the  same  node  class  Kw-  We  call  an  architecture  ^  structured  if  at 
least  one  node  class  has  more  than  one  element.  Then  the  architecture  with  b  node 
classes  Kwi  (*  =  1, . . . ,  6)  is  denoted  Af^ai^Wi ,  ■  -  ,  A'lyJ. 

The  Vapnik-Chervonenkis  dimension  dye  [9]  of  a  feedforward  architecture  is  defined 
by  dvc  sup  {m  G  IN  |  p(m)  =  2^°}.  Let  Q  :=  |m  G  IN  >  || .  Then 

c  supQ  for  Q  ^  ot  c  :=  0  for  Q  =  0,  is  an  upper  bound  for  the  Vapnik- 
Chervonenkis  dimension  and  is  also  defined  as  capacity  in  [2,  7]. 

3  Upper  Bounds  for  the  Capacity 

In  this  section  is  shown  how  structured  architectures  can  be  transformed  into 
conventional  architectures  with  independent  weights.  The  upper  bounds  for  the 
capacity  of  these  conventional  architectures  then  are  applied  to  the  structured 
architectures.  A  basic  transformation  needed  in  the  following  derivations  is  the 
transformation  of  structured  one-layer  architectures  Af{Kwi ,  •  •  • ,  with  input 

nodes  of  outdegree  >  1  and  input  vectors  xi  into  structured  one-layer  architectures 
Af'{Kwi ,  •  •  •  5  ^^Wh)  with  input  nodes  of  outdegree  1  only  and  dependent  input  vec¬ 
tors  xi  {I  =  Every  input  node  with  outdegree  z  >  1  is  replaced  by 

copies  of  that  input  node.  The  outgoing  edges  of  the  input  node  are  assigned 
to  the  copies  in  such  a  way  that  every  copy  has  outdegree  1.  The  elements  of 
the  input  vectors  are  duplicated  in  the  same  way.  By  permuting  the  input  nodes 
and  the  corresponding  components  of  the  input  vectors  we  get  the  architecture 
Af'"{Kwi  5  •  •  • ,  without  any  intersecting  edges. 

3.1  Structured  One-layer  Architectures 

I)  First  we  focus  on  structured  one-layer  architectures  Af^g^C^w)  with  a  set 
I  {ui, . . .  ,Ue}  of  e  input  nodes  and  the  output  layer  K  :=  Let 

Kw  ’=  A"  be  the  only  node  class.  All  nodes  in  Kw  =  K  have  the  same  indegree 
dG  IN. 

Theorem  1  Lei  a  structured  one-layer  architecture  Af}  g{Kw)  with  only  one  node 
class  Kw  =  K  be  given.  Suppose  d  G  IN  as  the  indegree  of  all  nodes  in  Kw-  The 
number  of  input  nodes  is  e  <  a  -  d.  For  m  input  vectors  of  length  e  an  upper  bound 
for  the  growth  function  g{m)  of  the  structured  one-layer  architecture  Aflg{Kw)  is 
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given  by 

1=0  ^ 

Proof  At  first  we  examine  structured  one-layer  architectures  with  outdegree  1  for 
every  input  node,  equivalent  to  architectures  with  e  =  a  d  input  nodes. 

By  permuting  the  input  nodes  and  the  corresponding  components  of  the  input 
vectors  we  get  the  architecture  J\f"{Kw)-  Without  loss  of  generality  we  consider 
the  permutation  tt  of  the  node  class  Kw  as  the  identity  function.  Thus  we  have 
w  =  w{ki)  =  ...=:  w(ka)  £  for  the  a  weight  vectors  (cf.  Figure  la)).  For  m 
fixed  input  vectors  xi  :=  . . . ,  xf)  £  {x\  €  IR"^,  /  =  1, . . . ,  m,  z  =  1, , . . ,  a) 

let  S  be  an  (m  x  a-d)-input  matrix  for  ^[Kw)'- 


A  given  weight  vector  wi  G  IR'^  defines  a  function  Fi  :  IR"*^  — >■  {0, 1}"  or  a  net 


Figure  1  a)  Structured  architecture  J\f{Ku))"  with  an  input  vector  xj  (/  G 
m}).  b)  Architecture  ^  with  the  corresponding  a  input  vectors  . . .  ,x^. 

A^i,  respectively.  Let  W2  be  a  weight  vector  that  defines  a  function  F2  (a  net  N2) 
different  to  Fi  on  the  input  matrix  S.  Thus  these  two  nets  are  elements  of  different 
net  classes  of  related  to  the  input  matrix  S.  Now  we  consider  the  one- 

layer  architecture  A/J  j,  consisting  of  a  single  node  with  indegree  d.  By  rearranging 
the  m  rows  of  the  input  matrix  5  for  J\f'\Kw)  to  one  column  of  the  m-a  input 
vectors  x\  G  IR^  (/  =  1, . . . ,  m,  z  =  1, . . . ,  a)  we  derive  the  {m-a  x  d)-input  matrix 
S  for  ^  (cf.  Figure  1  b)): 


On  A/2_i  the  weight  vector  wi  {W2)  defines  a  function  Fi  :  IR'^  — >  {0, 1}  (F2  ^  IR'^  — ^ 
{0, 1})  or  a  net  Ni  {N2),  respectively.  Because  of  ^1(0:5)  ^  F2(xs)  for  at  least  one 
input  vector  Xg  {s  £  {1, . . . ,  m})  and  definition  (1)  the  nets  Ni  and  N2  are  elements 
of  different  net  classes  of  A/}  ^  related  to  the  input  matrix  S.  Summarizing  we  get: 
if  two  nets  of  the  architecture  are  different  related  to  any  input  matrix 
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S  we  can  define  an  input  matrix  S  for  by  (1),  so  that  the  corresponding  nets 
are  different,  too.  For  the  number  of  net  classes  this  yields 

AiS)  <  A{S)  .  (2) 

With  the  results  of  [7]  the  growth  function  of  the  architecture  i  is  given  by 
C(m-a,d).  From  (2)  also  follows  that  this  is  an  upper  bound  for  the  growth  func¬ 
tion  of  the  structured  one-layer  architecture  Af"{Kw)  or  respectively: 

g(m)  <  (7(m-a,d).  The  inequation  g(m)  >  C{m-a,d)  can  easily  be  verified  in  a 
similar  way,  so  it  yields  g{m)  =  C(m-a,d)  for  the  growth  function  of  structured 
one-layer  architectures  Afg  aiKw)  with  outdegree  1  for  every  input  node. 

Now  we  consider  structured  one-layer  architectures  with  outdegree  z  >  1 

for  some  input  nodes.  These  architectures  can  be  transformed  into  structured  one- 
layer  architectures  M“{Kw)  with  e  =  a  -  d  input  nodes  all  with  outdegree  1.  But 
the  input  vectors  of  the  input  matrix  for  the  transformed  architecture  J\f"(Kw) 
cannot  be  chosen  totally  independent.  Thus,  C(m-a,d)  is  an  upper  bound  for  the 
growth  function  of  structured  one-layer  architectures  Afj^^{Kw)  with  exactly  one 
node  class  Kw  =  K. 

Remark  1  With  [7]  we  find  ^  as  an  upper  bound  for  the  capacity  of  structured 
oneAayer  architectures  fif I with  exactly  one  node  class  Kw- 

II)  Second  we  focus  on  structured  one-layer  architectures  J\f^^^(Kwi  >  •  •  •  j  Kw^  with 
b  (2  <  b  <  a)  node  classes  Kwi Kwi  ■  These  classes  form  the  set  K  of  the  a 
output  nodes:  K  =  Kwi^  -  ■  -  ^Kwi- 

Theorem  2  Assume  a  structured  one-layer  architecture  Aff^^(Kwi^  -  •  -  C^Wb) 
with  e  <  ai-di  input  nodes,  a  =  ai  output  nodes,  and  6  €  IN  ft  <b  <  a) 
node  classes  Kwi  («'  =  1,...,^).  Let  oti  \=  \Kwi\  he  the  sizes  of  the  node  classes 
Kwi>  o-ud  di  the  indegrees  of  the  nodes  in  Kwi  (i  =  1, . .  For  m  input  vectors 
the  product 

b 

J^C{m‘ai,di) 

i=:l 

is  an  upper  bound  for  the  growth  function  of  Kwb)- 

Proof  At  first  we  examine  structured  one-layer  architectures  ^  -  -  -  ■>  Kwb) 

with  e  -  c^i-di  input  nodes  (all  with  outdegree  1)  and  an  (m  x  e)-input  ma¬ 

trix  S.  The  a  nodes  in  the  output  layer  K  are  permuted  so  that  we  get  the  ordered  se¬ 
quence  K  =  {Kwi ,  •  •  • ,  Kwb  }  with  Kwi  -=  J  («  -  1,  •  •  • ,  These  per¬ 
mutations  generate  b  structured  one-layer  sub  architectures  with  e*  := 

ai-di  input  nodes,  Oj  output  nodes  and  (mx  ei)-sub  input  matrices  {i  =  1, . . . ,  6). 
With  Theorem  1  we  get  gi{m)  <  C{m’ai,di)  for  the  growth  functions  gi{m)  of 
these  sub  architectures.  For  the  determination  of  the  growth  function  g{m)  of 
^{Kw I  j  •  •  • )  Rwb)  h  input  matrices  5*  for  the  sub  architectures  can  be  cho¬ 
sen  independently.  Thus  we  get  g{m)  =  HLi  <  IlLi  C{m-cxi,di).  The  result 

for  structured  one-layer  architectures  J\ff^(Kwi,---,Kwb)  with  outdegree  >  1  for 
some  input  nodes,  equivalent  to  e  <  ard*  input  nodes,  follows  in  a  similar 

way  to  Theorem  1.  D 
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Theorem  Z  For  the  capacity  of  a  structured  one-layer  architecture 
jKw}))  with  b  £  JN  [2  <  b  <  a)  node  classes  Kwi  (I  =  l,...,b), 
maximum  indegree  d  >2  for  all  nodes,  maximum  size  a  :=  \Kwi\  of  the  node 
classes,  and  t  :=  ~  >  2  we  get 


Proof  For  the  growth  function  g{m)  of  the  architecture  ,  •  • . ,  Kw^)  we 

get  with  the  above  definitions  and  Theorem  2: 


h  h 

5'(^)  <  C(m-0!i,  di)  <  JJ  d)  =  C(m-a,  d)  . 

i  =  l  i=zl 


This  yields  an  upper  bound  for  the  capacity:  c  <  sup  |  m  €  IN 
With  some  estimations  and  const  :=  it  follows: 


C(ma,d) 


I  •  a 

m  <  const  ■  ■  ln(/)  . 

a 

For  details  and  further  information  see  [8].  □ 

3.2  Structured  Multi-layer  Architectures 

Consider  a  structured  r-layer  architecture  with  e  input  nodes,  aj  nodes  in  the 
hidden  layers  {j  =  —  1)  and  a  nodes  in  the  output  layer  K.  Let 

the  layers  be  the  disjoint  union  of  the  bj  <  aj  node  classes 

and  the  output  layer  the  disjoint  union  of  the  node  classes  Kwi  {i  = 

The  number  of  node  classes  is  XlylJ  bj  -{-b  =:  p.  The  structured  architecture  is 
denoted  by  Afe^ai^w\^  ■  -  -  j  A  structured  r-layer  feedforward  architecture 

can  be  regarded  as  a  combination  of  r  structured  one-layer 
feedforward  architectures  since  the  output  matrices  of  each  layer  are  the  input  ma¬ 
trices  for  the  following  layer.  Thus,  we  get  an  upper  bound  for  the  growth  function 
g{m)  of  5  •  •  • ,  ^Wb)  by  multiplying  the  growth  functions  of  the  r  struc¬ 

tured  one-layer  architectures  (refer  to  Theorem  2): 

r-i  fbj  \  b 

5H<n  -JlCim-ocudi)  .  (3) 

j=i  \i=zi  J  i-i 

With  the  maximum  size  S  :=  max  {a, , . . . .  a,,,  aj , . . . ,  }  of  the  0  node  classes, 

and  the  maximum  ind^gree  d  of  all  nodes  of  the  architecture 
-^e,a{^wl )  ■  •  •  J  A'luj),  C(m-a,  d)^  is  an  upper  bound  for  (3). 

Theorem  4  Let  •^fe,a{^w\  ^  •  j  ^^Wb)  be  a  structured  r-layer  feedforward  architec¬ 
ture  with  P  >  2  node  classes  .  .,Kwb,  d  >  2  the  maximum  indegree  of  all 

nodes,  a  the  maximum  size  of  all  P  node  classes,  andt:=  ^  >  2.  For  the  capacity 
Kwb)  we  get 
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Proof  Analogous  to  the  proof  of  Theorem  3. 

An  architecture  ,  Kwi)  with  hj  -\-h  =  uj  +  a  node  classes 

is  equivalent  to  an’ architecture  in  which  every  node  class  has  size  1.  So  the 
above  upper  bounds  for  the  capacity  hold  good  for  conventional  r-layer  feedforward 
architectures,  too. 

4  Conclusion 

By  transforming  architectures  with  shared  weight  vectors  into  equivalent  conven¬ 
tional  feedforward  architectures  and  the  extension  of  the  definitions  of  the  growth 
function  and  the  capacity  to  multi-layer  feedforward  architectures  we  obtain  estima¬ 
tions  for  the  upper  bounds  of  the  capacity  of  structured  multi-layer  architectures. 
These  upper  bounds  depend  with  0{^  •  Inf)  on  the  number  p  of  free  parameters 
in  the  structured  neural  architecture  with  maximum  size  a  of  the  /?  node  classes, 
f ^  >  2,  and  a  nodes  in  the  output  layer.  So  weight  sharing  in  a  fixed  neural 
architecture  leads  to  a  reduction  of  the  upper  bound  of  the  capacity.  The  amount  of 
the  reduction  increases  with  the  extent  of  the  weight  sharing.  With  a  =  1  the  upper 
bounds  hold  good  for  conventional  feedforward  networks  with  independent  weights, 
too.  It  is  known  that  the  generalization  ability  of  a  feedforward  neural  architecture 
improves  within  certain  limits  with  a  reduction  of  the  capacity  for  a  fixed  number  of 
training  samples.  As  a  consequence  of  our  results  a  better  generalization  ability  can 
be  derived  for  structured  neural  architectures  compared  to  the  same  unstructured 
ones.  This  is  a  theoretic  justification  for  the  generalization  ability  of  structured 
neural  architectures  observed  in  experiments  [5].  Further  investigations  focus  on 
an  improvement  of  the  upper  bounds,  on  the  determination  of  capacity  bounds 
for  special  structured  architectures,  and  on  the  derivation  of  capacity  bounds  for 
structured  architectures  of  nodes  with  continuous  transfer  functions  [3,  6]. 
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We  present  an  analytic  solution  to  the  problem  of  on-line  gradient-descent  learning  for  two-layer 
neural  networks  with  an  arbitrary  munber  of  hidden  units  in  both  teacher  and  student  networks. 
The  technique,  demonstrated  here  for  the  case  of  adaptive  input- to-hidden  weights,  becomes  exact 
as  the  dimensionality  of  the  input  space  increases. 

Layered  neural  networks  are  of  interest  for  their  ability  to  implement  input-output 
maps  [1].  Classification  and  regression  tasks  formulated  as  a  map  from  an  N- 
dimensional  input  space  ^  onto  a  scalar  C  are  realized  through  a  map  C  =  /j(C), 
which  can  be  modified  through  changes  in  the  internal  parameters  {J}  specifying 
the  strength  of  the  interneuron  couplings.  Learning  refers  to  the  moification  of 
these  couplings  so  as  to  bring  the  map  /j  implemented  by  the  network  as  close 
as  possible  to  a  desired  map  /.  Information  about  the  desired  map  is  provided 
through  independent  examples  with  =  /(^^)  for  all  pi.  A  recently  in¬ 

troduced  approach  investigates  on-line  learning  [2].  In  this  scenario  the  couplings 
are  adjusted  to  minimize  the  error  after  the  presentation  of  each  example.  The  re¬ 
sulting  changes  in  {J}  are  described  as  a  dynamical  evolution,  with  the  number  of 
examples  playing  the  role  of  time.  The  average  that  accounts  for  the  disorder  in¬ 
troduced  by  the  independent  random  selection  of  an  example  at  each  time  step  can 
be  performed  directly.  The  result  is  expressed  in  the  form  of  dynamical  equations 
for  order  parameters  which  describe  correlations  among  the  various  nodes  in  the 
trained  network  as  well  as  their  degree  of  specialization  towards  the  implementa¬ 
tion  of  the  desired  task.  Here  we  obtain  analytic  equations  of  motion  for  the  order 
parameters  in  a  general  two-layer  scenario:  a  student  network  composed  of  N  input 
units,  K  hidden  units,  and  a  single  linear  output  unit  is  trained  to  perform  a  task 
defined  through  a  teacher  network  of  similar  architecture  except  that  its  number 
M  of  hidden  units  is  not  necessarily  equal  to  K.  Two-layer  networks  with  an  ar¬ 
bitrary  number  of  hidden  units  have  been  shown  to  be  universal  approximators  [1] 
for  N-to-one  dimensional  maps.  Our  results  thus  describe  the  learning  of  tasks  of 
arbitrary  complexity  (general  M),  The  complexity  of  the  student  network  is  also 
arbitrary  (general  K,  independent  of  M),  providing  a  tool  to  investigate  realizable 
(K  =  M),  over-realizable  (K  >  M),  and  unrealizable  [K  <  M)  learning  scenarios. 
In  this  paper  we  limit  our  discussion  to  the  case  of  the  soft-committee  machine 
[2],  in  which  all  the  hidden  units  are  connected  to  the  output  unit  with  positive 
couplings  of  unit  strength,  and  only  the  input-to-hidden  couplings  are  adaptive. 
Consider  the  student  network:  hidden  unit  i  receives  information  from  input  unit 
r  through  the  weight  and  its  activation  under  presentation  of  an  input  pattern 
4  is  Xi  =  Jj  ■  with  Jj  =  {Uiiy  •  •  •  1  Jin)  defined  as  the  vector  of 

incoming  weights  onto  the  i-th  hidden  unit.  The  output  of  the  student  network  is 
=  E*=i  9  0^  where  g  is  the  activation  function  of  the  hidden  units, 
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taken  here  to  be  the  error  function  g{x)  =  erf(a?/\/2),  and  J  =  is  the  set 

of  input-to-hidden  adaptive  weights.  Training  examples  are  of  the  form  The 

components  of  the  independently  drawn  input  vectors  are  uncorrelated  random 
variables  with  zero  mean  and  unit  variance.  The  corresponding  output  is  given 
by  a  deterministic  teacher  whose  internal  structure  is  that  of  a  network  similar  to 
the  student  except  for  a  possible  difference  in  the  number  M  of  hidden  units.  Hid¬ 
den  unit  n  in  the  teacher  network  receives  input  information  through  the  weight 
vector  Bn  =  (5„i, . . . ,  SnAr),  and  its  activation  under  presentation  of  the  input 
pattern  is  =  Bn  •  The  corresponding  output  is  =  J2n=i  3  (Bn  -4^). 
We  will  use  indices  ij,  kJ..Ao  refer  to  units  in  the  student  network,  and 
for  units  in  the  teacher  network.  The  error  made  by  a  student  with  weights  J  on  a 
given  input  ^  is  given  by  the  quadratic  deviation 

1  1  r  ^  M  -|2 

€(J.O=  =  2  (1) 

L  i=l  n=l 

Performance  on  a  typical  input  defines  the  generalization  error  ^^(J)  = 

<  e(J,4)  >{^}  through  an  average  over  all  possible  input  vectors  to  be  per¬ 
formed  implicitly  through  averages  over  the  activations  x  =  {xi, . . .  ,xk)  and 
y  =  (s/i,  •  •  • )  2/m)-  Note  that  both  <  Xi  >=^<  yn  >=  0,  while  the  components  of  the 
covariance  matrix  C  are  given  by  overlaps  among  the  weight  vectors  associated  with 
the  various  hidden  units:  <  Xi  Xk  >  =  •  Jjfc  =  Qik,  <  Vn  >  =  ’  B„  =  Rin, 

and  <  yn  Vm  >  ~  '  Bm  =  Tnm-  The  averages  over  x  and  y  are  performed  using 

a  joint  probability  distribution  given  by  the  multivariate  Gaussian: 


P(x,y)  = 


'(2^)k+m^C\ 


i(x,yfC-'(x,y) 


with  C  = 


Q  R 
RL  T 


The  averaging  yields  an  expression  for  the  generalization  error  in  terms  of  the  order 
parameters  Qa,  Rin,  and  T„^.  For  g{x)  =  erf(a;/v^)  the  result  is: 


e.(J)  = 


_ Qik 

+  Qii  V 


'l  +  Tnn 


"v/l  +  Qii  V  1  +  Tnn  j 

The  parameters  Tnm  are  characteristic  of  the  task  to  be  learned  and  remain  fixed, 
while  the  overlaps  Qa  and  Rin  are  determined  by  the  student  weights  J  and  evolve 
during  training.  A  gradient  descent  rule  for  the  update  of  the  student  weights  results 
in  -I-  ^  where  the  learning  rate  77  has  been  scaled  with  the  input 

size  N,  and  S'-  =  g'{x^)  “  Ef=i  s(®i)]  defined  in  terms  of  both 

the  activation  function  g  and  its  derivative  g* .  The  time  evolution  of  the  overlaps 
Rin  and  Qik  can  be  explicitly  written  in  terms  of  similar  difference  equations.  The 
dependence  on  the  current  input  is  only  through  the  activations  x  and  y,  and  the 
corresponding  averages  can  be  performed  using  the  joint  probability  distribution 
(2).  In  the  thermodynamic  limit  TV  — v  00  the  normalized  example  number  a  =  g/N 
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can  be  interpreted  as  a  continuous  time  variable,  leading  to  the  equations  of  motion: 

dRin 


da 

dQik 

da 


»?  n,m)-^  73(2,  n,i) I  , 


\  ^h{i,k,n,m)-2'^l4{i,k,j,n)  +  '^l4{i,k,j,l)  >  .  (4) 

j,n  j,l  J 

The  two  multivariate  Gaussian  integrals:  I3  =  <  g'(u)  v  g(w)  >  and  74  =  < 
g'{u)  g'{v)  g{w)  g{z)  >  represent  averages  over  the  probability  distribution  (2).  The 
averages  can  be  performed  analytically  for  the  choice  g{x)  =  erf(j:/\/2).  Arguments 
assigned  to  J3  and  I4  are  to  be  interpreted  following  our  convention  to  distinguish 
student  from  teacher  activations.  For  example,  73(z,  n,  j)  =  <  g'{xi)  yn  g{xj)  >,  and 
the  average  is  performed  using  the  three-dimensional  covariance  matrix  C3  which 
results  from  projecting  the  full  covariance  matrix  C  of  Eq.  (2)  onto  the  relevant 
subspace.  For  73(2,  n,  j)  the  corresponding  matrix  is: 


C3  = 


Qii 

Rin 

Qij 

Rin 

Tnn 

Rjn 

Qij 

Rjn 

Qjj 

I3  is  given  in  terms  of  the  components  of  the  C3  covariance  matrix  by 
7  ^  2  __1_  C23(1  +  Cii)-Ci2Ci3 
TT  y/K^  1  -j-  Cii 


(5) 


with  A3  —  (l4-C'ii)(l-f-C'33)  — C^g.  The  expression  for  I4  in  terms  of  the  components 
of  the  corresponding  C4  covariance  matrix  is 


h  = 


1 


arcsm 


where  A4  =  (!-}-  Cii)(l  C22)  -  0^2^  and 


\y/A^y/K^J 


(6) 


Ao  —  K4C34  —  (723^24(1  +  Gii)  —  Ci3(7i4(1  +  C22)  +  ^12^13024  +  C12C1AC22  ■, 

Ai  —  A4(1  +  Czz)  —  023(1  +  Oil)  ~  0^3(1  -{-  O22)  +  2O12O13O23  , 

A2  A4(1  O44)  —  014(1  +  Oil)  0^4(1  +  O22)  +  2O12O14O24  . 

These  dynamical  equations  provide  a  novel  tool  for  analyzing  the  learning  process 
for  a  general  soft- commit  tee  machine  with  an  arbitrary  number  K  of  hidden  units, 
trained  to  perform  a  task  defined  by  a  soft- committee  teacher  with  M  hidden  units. 
This  set  of  coupled  first-order  differential  equations  can  be  easily  solved  numeri¬ 
cally,  even  for  large  values  of  K  and  il7,  providing  valuable  insight  into  the  process 
of  learning  in  multilayer  networks,  and  allowing  for  the  calculation  of  the  time 
evolution  of  the  generalization  error  [3].  In  what  follows  we  focus  on  learning  a  re¬ 
alizable  task  {K  =  M)  defined  through  uncorrelated  teacher  vectors  of  unit  length 
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(Tnm  =  Snm)‘  The  time  evolution  of  the  overlaps  and  Qik  follows  from  inte¬ 
grating  the  equations  of  motion  (3)  from  initial  conditions  determined  by  a  random 
initialization  of  the  student  vectors  Random  initial  norms  Qa  for  the 

student  vectors  are  taken  here  from  a  uniform  distribution  in  the  [0,0.5]  interval. 
Overlaps  Qik  between  independently  chosen  student  vectors  Ji  and  J*,  or  Rin  be¬ 
tween  Ji  and  an  unknown  teacher  vector  B„  are  small  numbers,  of  order  1/ \/N  for 
N  ^  K,  and  taken  here  from  a  uniform  distribution  in  the  [0, 10“^^]  interval.  We 
show  in  Fig.  la-c  the  resulting  evolution  of  the  overlaps  and  generalization  error  for 
K  =  Z  and  g  =  0.1.  This  example  illustrates  the  successive  regimes  of  the  learning 
process.  The  system  quickly  evolves  into  a  symmetric  subspace  controlled  by  an 
unstable  suboptimal  solution  which  exhibits  no  differentiation  among  the  various 
student  hidden  units.  Trapping  in  the  symmetric  subspace  prevents  the  specializa¬ 
tion  needed  to  achieve  the  optimal  solution,  and  the  generalization  error  remains 
finite,  as  shown  by  the  plateau  in  Fig.  Ic.  The  symmetric  solution  is  unstable, 
and  the  perturbation  introduced  through  the  random  initialization  of  the  overlaps 
Rin  eventually  takes  over:  the  student  units  become  specialized  and  the  matrix 
R  of  student-teacher  overlaps  tends  towards  the  matrix  T,  except  for  a  permuta- 
tional  symmetry  associated  with  the  arbitrary  labeling  of  the  student  hidden  units. 
The  generalization  error  plateau  is  followed  by  a  monotonic  decrease  towards  zero 
once  the  specialization  begins  and  the  system  evolves  towards  the  optimal  solution. 
Curves  for  the  time  evolution  of  the  generalization  error  for  different  values  of  g 
shown  in  Fig.  Id  for  X  =  3  identify  trapping  in  the  symmetric  subspace  as  a  small 
g  phenomenon.  We  therefore  consider  the  equations  of  motion  (3)  in  the  small  g 
regime.  The  term  proportional  to  7?^  is  neglected  and  the  resulting  truncated  equa¬ 
tions  of  motion  are  used  to  investigate  a  phase  characterized  by  students  of  similar 
norms:  Qu  =  Q  for  all  1  <i  <  K,  similar  correlations  among  themselves:  Qik  =  C 
for  all  i  fc,  and  similar  correlations  with  the  teacher  vectors:  Rin  =  R  for  all 
I  <  i,n  <  K.  The  resulting  dynamical  equations  exhibit  a  fixed  point  solution  at 
Q*  C*  =  1/{2K  -  1)  and  R*  =  y/Q*jK  =  l/y/K{2K  -  1).  The  corresponding 
generalization  error  is  given  by  €*  =  (K/tt)  {tt/G  —  K  arcsin  ((2/<’)“^)}.  A  simple 
geometrical  picture  explains  the  relation  Q*  =  C*  =  K{R*)'^  at  the  symmetric 
fixed  point.  The  learning  process  confines  the  student  vectors  {Jj}  to  the  subspace 
Sb  spanned  by  the  set  of  teacher  vectors  {B„}.  For  Tnm  =  ^nm  the  teacher  vectors 
form  an  orthonormal  set:  Bn  —  e„,  with  On  •  e„^  =  Snm  for  1  <  n,m  <  K,  and 
provide  an  expansion  for  the  weight  vectors  of  the  trained  student:  J*  =  Yin  ^in^n- 
The  student-teacher  overlaps  Rin  are  independent  of  i  in  the  symmetric  phase  and 
independent  of  n  for  an  isotropic  teacher:  Rin  =  R*  for  all  I  <  i,n  <  K .  The 
expansion  J-  =  R*  Y2n  results  in  Q*  =  C*  =  The  length  of  the  sym¬ 

metric  plateau  is  controlled  by  the  degree  of  asymmetry  in  the  initial  conditions  [2] 
and  by  the  learning  rate  g.  The  small  g  analysis  predicts  trapping  times  inversely 
proportional  to  t/,  in  quantitative  agreement  with  the  shrinking  plateau  of  Fig.  Id. 
The  increase  in  the  height  of  the  plateau  with  decreasing  t?  is  a  second  order  effect 
[3],  as  the  truncated  equations  of  motion  predict  a  unique  value  of  e*  =  0.0203 
at  A  =  3.  Escape  from  the  symmetric  subspace  signals  the  onset  of  hidden  unit 
specialization.  As  shown  in  Fig.  lb,  the  process  is  driven  by  a  breaking  of  the 
uniformity  of  the  student- teacher  correlations  [3]:  each  student  node  becomes  in¬ 
creasingly  specialized  to  a  specific  teacher  node,  while  its  overlap  with  the  remaining 
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Figure  1  The  overlaps  and  the  generalization  error  as  a  function  of  a  for  a  three- 
node  student  learning  an  isotropic  teacher  {Tnm  =  Snm)-  Results  for  r)  ~  0.1  are 
shown  for  (a)  student-student  overlaps  Qi^,  (b)  student-teacher  overlaps  R,„,  and 
(c)  the  generalization  error.  The  generalization  error  for  different  values  of  the 
learning  rate  rj  is  shown  in  (d). 

teacher  nodes  decreases  and  eventually  decays  to  zero.  We  thus  distinguish  between 
a  growing  overlap  R  between  a  given  student  node  and  the  teacher  node  it  begins 
to  imitate,  and  decaying  secondary  overlaps  S  between  the  same  student  node  and 
the  remaining  teacher  nodes.  Further  specialization  involves  the  decay  to  zero  of 
the  student-student  correlations  C  and  the  growth  of  the  norms  Q  of  the  student 
vectors.  The  student  nodes  can  be  relabeled  so  as  to  bring  the  matrix  of  student- 
teacher  overlaps  to  the  form  Rin  =  R6in-\-S{l  —  Sin);  the  matrix  of  student-student 
overlaps  is  of  the  form  Qi^  =  QSik  +  C(1  —  Sik).  The  subsequent  evolution  of  the 
system  converges  to  an  optimal  solution  with  perfect  generalization,  characterized 
by  a  fixed  point  at  (R*)^  =  Q*  =  1  and  S*  =  C*  —  0,  with  c*  =  0.  Linearization 
of  the  full  equations  of  motion  around  the  asymptotic  fixed  point  results  in  four 
eigenvalues,  of  which  only  two  control  convergence.  An  initially  slow  mode  is  char¬ 
acterized  by  a  negative  eigenvalue  that  decreases  monotonically  with  rj,  while  an 
initially  faster  mode  is  characterized  by  an  eigenvalue  that  eventually  increases  and 
becomes  positive  at  rjmax  —  7r/(\/3Ar),  to  first  order  in  1/A'.  Exponential  conver¬ 
gence  of  A,  5,  C,  and  Q  to  their  optimal  values  is  guaranteed  for  all  learning  rates 
in  the  range  (0,  rjmax);  in  this  regime  the  generalization  error  decays  exponentially 
to  e*  =  0,  with  a  rate  controlled  by  the  slowest  decay  mode. 
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Freeman’s  investigations  on  the  olfactory  bulb  of  the  rabbit  showed  that  its  dynamics  was  chaotic, 
and  that  recognition  of  a  learned  pattern  is  linked  to  a  dimension  reduction  of  the  dynamics  on 
a  much  simpler  attractor  (near  limit  cycle).  We  adress  here  the  question  wether  this  behaviour 
is  specific  of  this  particular  architecture  or  if  this  kind  of  behaviovu  observed  is  an  important 
property  of  chaotic  neural  network  using  a  Hebb-  like  learning  rule.  In  this  paper,  we  use  a  mean- 
field  theoretical  statement  to  determine  the  spontaneous  dynamics  of  an  assymetric  recurrent 
neirral  network.  In  particular  we  determine  the  range  of  random  weight  matrix  for  which  the 
network  is  chaotic.  We  are  able  to  explain  the  various  changes  observed  in  the  dynamical  regime 
when  sending  static  random  patterns.  We  propose  a  Hebb-like  learning  rule  to  store  a  pattern 
as  a  limit  cycle  or  strange  attractor.  We  numerically  show  the  dynamics  reduction  of  a  finite- 
size  chaotic  network  during  learning  and  recognition  of  a  pattern.  Though  associative  learning 
is  actually  performed  the  low  storage  capacity  of  the  system  leads  to  the  consideration  of  more 
realistic  architecture. 

1  Introduction 

Most  part  of  studies  on  recurrent  neural  networks  assume  sufficient  conditions  of 
convergence.  Relaxation  to  a  stable  network  state  is  simply  interpreted  as  a  stored 
pattern.  Models  with  symmetric  synaptic  connections  have  relaxation  dynamics. 
Networks  with  asymmetric  synaptic  connections  lose  this  convergence  property  and 
can  have  more  complex  dynamics.  However,  as  pointed  out  by  Hirsch,  [8],  it  might 
be  very  interesting,  from  an  engineering  point  of  view,  to  investigate  non  convergent 
networks  because  their  dynamical  possibilities  are  much  richer  for  a  given  number 
of  units. 

Moreover,  the  real  brain  is  a  highly  dynamic  system.  Recent  neurophysiological 
findings  have  focused  attention  on  the  rich  temporal  structures  (oscillations)  of 
neuronal  processes  [7,  6]  which  might  play  an  important  role  in  information  pro¬ 
cessing.  Chaotic  behavior  has  been  found  out  in  the  nervous  system  [2]  and  might 
be  implicated  in  cognitive  processes  [9].  Freeman’s  paradigm  [9]  is  that  the  basic 
dynamics  of  a  neural  system  is  chaotic  and  that  a  particular  pattern  is  stored  as  an 
attractor  of  lower  dimension  than  the  initial  chaotic  one.  The  learning  procedure 
thus  leads  to  the  creation  of  such  an  attractor.  During  the  recognition  process,  first, 
the  network  explores  a  large  region  of  its  phase  space  through  a  chaotic  dynamics. 
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When  the  stimulus  is  presented  then  the  dynamics  is  reduced  and  the  systems  fol¬ 
lows  the  lower  dimensional  attractor  which  has  been  created  during  the  learning 
process.  The  question  arises  wether  this  paradigm  which  has  been  simulated  in  [11] 
using  an  artifical  neural  network  is  due  to  a  very  specific  architecture  or  if  it  is  a 
general  phenomenum  for  recurrent  network. 

The  first  step  to  adress  this  problem  was  to  determine  the  conditions  for  the  exis¬ 
tence  of  chaotic  dynamics  among  the  various  architecture  of  recurrent  neural  net¬ 
works.  A  theoretical  major  advance  in  that  direction  was  achieved  by  Sompolinsky 
[10].  They  established  strong  theoretical  results  concernig  the  occurence  of  chaos  for 
fully  connected  assymetric  random  recurrent  networks  in  the  thermodynamic  limit 
by  using  dynamical  mean  field  theory.  In  their  model,  neurons  are  formal  neurons 
with  activation  state  in  [-1,1]  with  a  symmetric  transfer  function  and  no  threshold. 
The  authors  show  that  the  system  exhibits  chaotic  dynamics.  These  results  were 
extended  by  us  in  [5]  to  the  case  of  diluted  networks  with  discrete  time  dynamics. 
One  can  ask  whether  such  results  remain  valid  in  a  more  general  class  of  neural  net¬ 
works  with  no  reversal  symmetry  i.e.  with  activation  state  in  [0,1]  and  thresholds. 
The  presence  of  thresholds  is  biologically  interesting.  Moreover,  it  allows  to  study 
the  behaviour  of  the  network  when  submitted  to  an  external  input.  In  this  paper, 
we  describe  this  model  and  report  the  main  results  about  spontaneous  dynamics  in 
section  2.  In  section  3,  we  define  a  hebbian  learning  rule.  We  study  the  reduction  of 
the  dynamics  during  the  learning  process  and  the  recognition  of  a  learned  stimuli. 
We  then  discuss  the  results  and  conclude  (4). 

2  Spontaneous  Dynamics  of  Random  Recurrent  Networks  with 
Thresholds 

The  neurons  states  are  continuous  variables  Xi,  i  =  I , .  .N .  The  network  dynamics 
is  given  by: 

i=l...A:  Xi{t-\-l)  =  f 

where  Jij  is  the  synaptic  weight  between  the  neuron  j  and  the  neuron  i  .  The 
Jij's  are  independant  identically  distributed  random  variables  with  expectation 
E{Jij)  =  and  a  variance  Var{Jij)  =  ^  .  The  thresholds  {Oi)  are  independant 
identically  distributed  gaussian  random  variables  of  expectation  E{0i)  —  9  and 
variance  Var{6i)  =  Our  numerical  studies  are  made  with  the  sigmoidal  function: 
f{^)  =  neurons  states  Xi(t)  belong  to  [0,1].  Thus  Xi(t)  may  be 

seen  as  the  mean  firing  rate  of  a  neuron. 

This  kind  of  system  is  known  to  present  the  “propagation  of  chaos”  property  in  the 
thermodynamic  limit.  This  approach  was  initiated  for  neural  networks  by  Amari, 
[1].  Though  the  denomination  of  “chaos”  for  this  basic  properties  of  vanishing  finite- 
size  correlations  is  quite  confusing  and  has  nothing  to  do  with  determinsitic  chaos 
which  will  be  considered  afterwards,  we  shall  keep  it.  Namely  the  intr a- correlations 
between  finite  sets  of  neuron  activation  state  and  between  neurons  and  weights 
vanish  and  each  neuronal  activation  state  process  converges  in  law  in  the  thermo¬ 
dynamic  limit  towards  independant  gaussian  processes.  This  statement  allow  us  to 
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derive  mean  field  equations 


governing  the  limits  for  N 


ImAr(t) 
QN{t) 


i=l 

*=i 


These  mean  field  equations  are 


m{t  +  1)  =  J 

g(t+l)  =  J 


^  dh 

oo  y/^tTT 

^  dh 

oo  \/27r 


2  f[h\/ J^q{t)  +  <7^2  ^  rn{t)J  —  0 
f^[hy/Y^q{t)  H-  a9^  +  m(t)J  —  0 


(2) 


(3) 


We  shall  now  restrict  ourselves  for  numerical  studies  to  the  case  where  J  =  0. 
The  second  mean-field  equatioi^is  self-consistent  and  we  are  able  to  determine  the 
critical  surface  in  the  space  y  9^^  between  a  single  stable  fixed  point  and 

two  stable  fixed  points  (saddle-node  bifurcation)  .  Consider  a  fixed  point  x*  of  the 
system  (1),  Its  components  (a?*)  are  i.i.d.  gaussian  random  variables,  their  moments 
are  given  by  the  previous  mean-field  equations.  To  determine  the  stability  of  x*  in 
(1),  we  compute  the  Jacobi  matrix  of  the  evolution  operator 

D{x*)  =:  2gL{x*)J  (4) 

where  L{x*)  is  the  diagonal  matrix  of  components  (x*  -  x*^)  and  where  J  is  the 
connection  weight  matrix  .  The  spectral  radius  of  this  random  matrix  can  be  com¬ 
puted  and  this  computation  determines  a  stability  boundary  surface  as  shown  below 
in  figure  1.  At  the  thermodynamic  limit,  it  is  possible  to  use  the  “propagation  of 
chaos  property  to  compute  the  evolution  of  the  quadratic  distance  between  two 
trajectories  of  the  system  coming  for  arbitrary  close  initial  conditions  [3].  0  is  a  fixed 
point  for  this  recurrence  and  the  critical  condition  for  its  destabilization  is  exactly 
the  same  than  the  previous  condition  for  the  first  destabilization  of  the  fixed  point 
of  the  network.  This  proves,  that  at  the  thermodynamic  limit,  one  observes  a  sharp 
transition  between  a  fixed  point  regime  and  a  chaotic  regime.  So  in  the  bifurcation 
map  of  figure  1  there  are  four  different  regions:  -  region  1  :  there  is  only  one  stable 
fixed  point  -  region  2  :  there  are  two  stable  fixed  points  (actually  region  2  is  a  very 
small  cusp)  -  region  3  :  one  stable  fixed  point  and  one  strange  attractor  coexist. 
One  may  converge  towards  one  or  the  other  depending  on  the  intial  conditions. 

-  region  4  :  there  is  only  one  strange  attractor  Regions  2  and  3  shrink  when  aO 
increases.  When  cr^  =  0.2,  only  regions  1  and  4  remain.  For  finite  size  systems,  the 
destabilization  and  apparition  of  chaos  boundaries  are  different.  There  exists  an  in¬ 
termediate  zone  where  the  system  is  periodic  or  quasiperiodic.  This  corresponds  to 
the  quasiperiodicity  route  to  chaos  we  observe  in  the  system  when  gj  is  increased 
[5,  4].  The  simulations  confirm  with  accuracy  all  these  theoretical  predictions  for 
medium-sized  networks  (from  128  units  up  to  512  units),  showing  the  progressive 
stretching  of  the  transition  zone  where  periodic  and  almost-periodic  attractors  take 
place. 
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saddle-node  boundary 


Figure  1  Saddle-node  bifurcation  surface  and  destabilization  boundary  in  the 
space. 


3  Learning  and  Retrieving  Random  Patterns. 

3.1  Network  Submitted  to  a  Static  Random  Input 

We  study  now  the  reaction  of  our  network  when  submitted  to  an  external  input. 
The  input  is  a  vector  of  random  gaussian  variables  L  independent  of  ^j’s  with  a 
mean  value  I  and  a  variance  cr|.  We  get 


Xi(t  +  1)  =  / 


N 

^  ^  ~  "b  L 

J  =  1 


(5) 


Hence  the  input  is  equivalent  to  a  threshold.  The  global  resulting  threshold  has  a 
mean  value  O  —  I  and  a  variance.  We  can  then  interpret  the  presentation  of  an  input 
as  a  change  of  the  parameters  of  the  system.  Its  effect  can  therefore  be  predicted 
from  the  bifurcation  map. 

If  the  configuration  of  the  network  stands  in  the  chaotic  region  near  the  boundary 
of  this  region,  then  the  presentation  of  an  input  will  tend  to  reduce  the  dynamics  of 
the  system  on  a  limit  cycle  by  crossing  the  boundary,  falling  into  the  limit  cycles  and 
T2  torus  area.  The  same  input  may  lead  to  different  answers  by  different  networks 
with  the  same  statistical  parameters.  This  modification  of  the  parameters  by  a 
pattern  presentation  creates  a  new  system  (with  a  different  attractor). 


3.2  Auto-associative  Learning. 

The  learning  procedure  we  define  will  enable  us  to  associate  a  limit  cycle  or  a 
strange  attractor  to  a  presented  pattern.  We  modify  the  connection  strength  by  a 
biologically  motivated  Hebb-like  rule: 

if  Xj{i)  >  0.5  then  Jij{t  +  1)  =  3ij{i)  +  ^  -  0.5]  [xi{t  +  1)  -  0.5] 

else  Jij{t  -\-l)  = 


We  add  the  constraint  that  a  weight  cannot  modify  its  sign,  a  is  the  learning  rate. 
The  learning  procedure  is  implemented  at  each  step  when  the  network  has  reached 
the  stationary  regime. 
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In  all  our  simulations  the  learning  procedure  reduces  the  fractal  dimension  of  the 
chaotic  attractors  of  the  dynamics.  Eventually  the  system  follows  an  inverse  quasi- 
periodicity  route  towards  a  fixed  point.  We  have  chosen  to  stop  the  procedure  on  a 
limit  cycle,  thus  associating  this  cycle  with  the  pattern  learned. 

In  fact,  one  can  speak  of  learning  if  the  network  has  a  selective  response  for  the 
learned  pattern,  and  if  the  learning  procedure  does  not  affect  the  dynamical  be¬ 
haviour  when  an  other  pattern  is  presented.  On  order  to  study  this  selectivity 
property,  we  make  the  following  simulation.  We  learn  one  prototype  pattern  (ie. 
we  iterate  the  learning  dynamics  upon  reaching  a  limit  cycle).  Then  we  present  30 
other  random  patterns  (drawn  randomly  with  the  same  statistics  than  the  learned 
pattern),  and  we  compare  the  attractors  before,  and  after  learning.  For  all  patterns 
but  one,  the  dynamics  before  learning  were  chaotic.  For  all  these  patterns,  the  dy¬ 
namics  were  still  chaotic  after  learning  (the  pattern  whose  response  was  periodic, 
had  still  a  periodic  attractor).  Hence  the  network  reduces  its  dynamics  only  for  the 
learned  prototype  pattern. 

3.3  Retrieval  Property 

In  order  to  study  the  retrieval  property  of  this  associative  learning  process,  we  add 
noise  to  a  pattern  previously  learned,  and  show  how  it  affects  the  recognition.  A 
pattern  is  a  vector  of  gaussian  random  variables  (for  the  simulations,  each  com¬ 
ponent  has  a  mean  zero  and  standard  deviation  0.1).  We  add  to  each  component 
a  gaussian  noise  of  mean  zero  and  standard  deviation  0.01,  and  0.02,  which  thus 
corresponds  to  a  level  of  noise  of  10%  and  20%  on  the  pattern.  We  recall  that  the 
presentation  of  a  noisy  pattern  changes  the  parameters  of  the  system.  So  recog¬ 
nition  has  to  be  defined  by  some  similarities  between  the  attractors.  In  order  to 
quantify  the  similarity  between  cycles,  we  compute  their  gravity  center,  their  mean 
radius  and  winding  (rotation)  number  and  we  define  a  crtierion  of  similarity  based 
on  these  numerical  indexes. 

To  estimate  the  recognition  rate,  we  learn  one  prototype  pattern.  Then  we  present 
30  noisy  patterns  (derived  from  the  prototype  one),  and  compute  the  similarity 
values.  With  a  10%  level  of  noise  recognition  rate  after  7  learning  steps  is  83%, 
with  20%  noise  it  falls  down  to  27%. 

4  Discussion  and  Conclusion 

Our  model  reproduces  the  observations  by  Freeman  concerning  the  dimension  re¬ 
duction  of  the  system  attractor  by  recognition  of  a  learned  pattern,  in  a  model 
of  the  olfactory  bulb  [11].  The  system  does  not  need  to  wait  until  a  fixed  point 
is  reached  to  perform  recognition:  relying  on  our  experiments,  convergence  on  an 
attractor  is  very  fast.  This  gives  an  insight  into  the  mechanism  leading  to  such  a 
phenomenon  by  the  extraction  of  the  few  relevant  parameters  related  to  it. 
However  this  model  suffers  severe  drawbacks  concerning  the  control  of  the  learning 
process  and  the  storage  capacity.  Moreover,  the  modifications  supported  by  the 
weights  during  the  learning  process  are  difficult  to  interpret  theoretically.  These 
limitations  suggest  to  introduce  at  the  next  step  architectural  and  functional  dif¬ 
ferentiation  into  the  network  (inhibitory  and  excitatory  neurons,  multiple  layers, 
random  geometric-dependant  connectivity  and  time  delays,  modular  architecture 
of  chaotic  oscillators). 


317 


Samuelides  et  al:  Spontaneous  Dynamics  and  Learning 

Coding  by  dynamical  attractors  is  also  particularly  suited  for  learning  of  temporal 
signals.  For  the  moment,  we  only  focused  on  the  learning  of  static  patterns.  However, 
we  performed  with  interesting  results  some  preliminary  simulations  on  presentation 
of  temporals  sequences.  This  could  lead  to  connect  different  chaotic  networks  in 
order  to  perform  recognition  tasks  using  the  synchronization  processes  highlighted 
by  Gray  [7]. 
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Statistical  mechanics  can  be  used  to  derive  a  set  of  equations  describing  the  evolution  of  a  genetic 
algorithm  involving  crossover,  mutation  and  selection.  This  paper  gives  an  introduction  to  this 
work.  It  is  shown  how  the  method  can  be  applied  to  to  very  simple  problems,  for  which  the 
dynamics  of  the  genetic  cilgorithm  can  be  reduced  to  a  set  of  nonlinear  coupled  difference  equations. 
Good  results  are  obtained  when  the  equations  are  tnmcated  to  four  variables. 
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1  Introduction 

Genetic  Algorithms  (GA)  are  a  class  of  search  techniques  which  can  be  used  to  find 
solutions  to  hard  problems.  They  have  been  applied  in  a  range  of  domains:  opti¬ 
misation  problems,  machine  learning,  training  neural  networks  or  evolving  neural 
network  architectures,  and  many  others.  (For  an  introduction,  see  Goldberg  [1].) 
They  can  be  naturally  applied  to  discrete  problems  for  which  other  techniques  are 
more  difficult  to  use,  and  they  parallelise  well.  Most  importantly,  they  have  been 
found  to  work  in  many  applications. 

Although  GAs  have  been  widely  studied  empirically,  they  are  not  well  understood 
theoretically.  Unlike  gradient  descent  search  and  simulated  annealing,  genetic  al¬ 
gorithms  are  not  based  on  a  well-understood  process.  The  goal  of  the  research  de¬ 
scribed  here  is  to  develop  a  formalism  which  allows  the  study  of  genetic  algorithm 
dynamics  for  problems  of  realistic  size  and  finite  population  sizes.  The  problems 
we  have  studied  are  clearly  toy  problems;  however,  they  contain  some  realistic  as¬ 
pects,  and  we  hope  their  consideration  will  be  a  stepping  stone  to  the  study  of  more 
realistic  problems. 

2  The  Statistical  Mechanics  Approach 

Ideally,  one  would  like  to  solve  the  dynamics  of  genetic  algorithms  exactly.  This 
could  be  done,  in  principle,  either  by  studying  the  stochastic  equation  directly,  or 
by  using  a  Markov  Chain  formulation  to  analyse  the  deterministic  equation  for  the 
probability  of  being  in  a  given  population  (see  [2,  3]).  However,  it  is  very  difficult 
to  make  progress  in  this  way,  because  one  must  solve  a  high-dimensional,  strongly 
interacting,  nonlinear  system  which  is  extremely  difficult  to  analyse. 

In  these  exact  approaches,  the  precise  details  of  which  individuals  are  in  the  popu¬ 
lation  is  considered.  In  our  approach  much  less  information  is  assumed  to  be  known. 
We  consider  only  statistical  properties  of  the  distribution  of  fitnesses  in  the  pop¬ 
ulation;  from  the  distribution  of  fitnesses  at  one  time,  we  predict  the  distribution 
of  fitnesses  at  the  next  time  step.  Iterating  this  from  the  initial  distribution,  we 
propose  to  predict  the  fitness  distribution  for  all  times. 

This  distribution  tells  you  want  you  want  to  know  -  for  example  the  evolution  of 
the  best  member  of  the  population  can  be  inferred.  But  is  it  possible,  in  principle, 
to  predict  the  fitness  at  later  times  based  on  the  fitness  at  earlier  times? 
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The  answer  is  no.  Although  it  is  possible  to  predict  the  effect  of  selection  based  solely 
on  the  fitness  distribution,  mutation  and  crossover  depend  upon  the  configurations 
of  the  strings  in  the  population  which  cannot  be  inferred  from  their  fitnesses.  We 
use  a  statistical  mechanics  assumption  to  bridge  this  gap. 

We  characterise  the  distribution  by  its  cumulants.  Cumulants  are  statistical  prop¬ 
erties  of  a  distribution  which  are  like  moments,  but  are  more  stable  and  robust.  The 
first  two  cumulants  are  the  mean  and  variance  respectively.  The  third  cumulant  is 
related  to  the  skewness;  it  measure  asymmetry  of  the  distribution  about  the  mean. 
The  fourth  cumulant  is  related  to  the  kurtosis;  it  measure  whether  the  distribution 
falls  off  faster  or  more  slowly  than  Gaussian. 

3  Test  Problems 

The  method  has  been  applied  to  four  problems.  The  simplest  task,  and  one  which 
will  be  used  throughout  this  paper  to  illustrate  the  method,  is  the  optimisation 
of  a  linear,  spatially  inhomogeneous  function,  counting  random  numbers.  In  this 
problem,  each  bit  of  the  string  contributes  individually  an  arbitrary  amount.  The 
fitness  is, 

N 

i=l 

Here  r  is  a  string  of  ±1  of  length  N  which  the  GA  searches  over.  The  Jj’s  are  fixed 
random  numbers. 

Other  problems  to  which  the  method  has  been  applied  include:  the  task  of  max¬ 
imising  the  energy  of  a  spin-glass  chain  [4,  5,  6],  the  subset  sum  problem  [7]  (an 
NP  hard  problem  in  the  weak  sense),  and  learning  in  an  Ising  perceptron  [8].  For 
the  first  two  problems,  the  dynamics  can  be  solved,  and  the  formalism  predicts  ac¬ 
curately  the  evolution  of  the  genetic  algorithm.  The  latter  problem  has  been  much 
more  difficult  to  solve,  however. 

4  The  Genetic  Operators 

We  now  discuss  how  our  approach  is  applied  to  the  study  of  three  specific  GA 
operators:  selection,  mutation,  and  crossover.  We  will  use  the  counting  random 
numbers  task  to  illustrate  how  the  calculations  are  done. 

4.1  Selection 

Selection  is  the  operation  whereby  better  members  of  the  population  are  replicated 
and  less  good  members  are  removed.  The  most  frequently  used  method  is  to  chose 
the  members  of  the  next  population  probabilistically  using  a  weighting  function 
R{F°‘).  We  have  studied  Boltzmann  selection 

R(F)  oc  exp(/?F) 

where  /?  is  a  parameter  controlling  the  selection  rate.  Fitness  proportional  selection 
{R{F)  =  F),  culling  selection  (choose  the  n  best),  and  other  forms  can  also  be 
handled  by  the  approach. 

In  an  infinite  population,  the  new  distribution  of  fitness  in  terms  of  the  old  p(F) 
would  simply  be  R{F)p(F)  up  to  a  normalisation  factor.  In  a  finite  population, 
there  will  be  fluctuations  around  this.  The  way  to  treat  this  is  to  draw  P  levels 
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from  p.  The  generating  function  for  cumulants  is 


G(t)  = 


p  ^ 

n  / 

a=i 


)dF^ 


log 


X;«(^'“)exp(-7F“) 


Lq;=1 


This  problem  is  analogous  to  the  Random  Energy  Model  proposed  and  studied  by 
Derrida  [9],  and  can  be  computed  by  using  the  same  methods  developed  for  that 
problem. 

For  Boltzmann  selection,  the  cumulant  expansion  for  the  distribution  after  selection 
in  terms  of  selection  before  can  be  found  as  an  expansion  in  the  selection  parameter 


Or  the  cumulants  can  be  found  for  arbitrary  (3  numerically. 

For  many  problems,  the  initial  distribution  will  be  nearly  Gaussian.  Selection  in¬ 
troduces  a  negative  third  cumulant  -  the  low-fitness  tail  is  more  occupied  than  the 
high-fitness  one.  An  optimal  amount  of  selection  is  indicated.  The  improvement  of 
the  mean  increases  with  /?  initially,  but  saturates  for  large  /?.  Selection  also  decreases 
the  variance;  this  effect  is  small  for  small  (3  but  becomes  important  for  increasing 
values.  The  trade-off  between  these  two  effects  of  selection  —  increase  in  mean  but 
decrease  in  genetic  diversity  —  are  balanced  for  intermediate  values  of  /?.  This  has 
been  discussed  in  earlier  work  [5,  6]. 


4.2  Mutation 

Mutation  causes  small  random  changes  to  the  individual  bits  of  the  string.  Thus,  it 
causes  a  local  exploration,  in  common  with  many  other  search  algorithms.  It  acts 
on  each  member  of  the  population  independently. 

To  study  the  effect  of  mutation,  we  introduce  a  set  of  mutation  variables,  mf ,  one  for 
each  site  of  each  string,  which  take  the  value  1  if  the  site  is  mutated,  0  otherwise. 
Let  m  be  the  mutation  probability.  In  terms  of  these  variables,  the  fitness  after 
mutation  is  (for  counting  random  numbers) 

F  =  ^J,(l-2mr)r?.  (2) 

i 

Averaging  over  all  variables  gives  the  cumulants  after  mutation.  This  yields, 

=  (1  —  2m)K,i 

(3) 

Kf  =  K2+{l-{l-2mf){N  -Ki) 

Kf  =  K3(l-2mf -2(l-2m)(l-(l-2»«)2)LVj3(^.) 

where  <  r  >  denotes  population  average.  (The  fourth  cumulant  can  be  calculated 
in  a  similar  fashion). 
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The  first  two  equations  express  obvious  effects.  Mutation  brings  the  mean  and 
variance  back  towards  values  of  a  random  population  —  mean  decreases  to  zero 
and  the  variance  increases  toward  N.  The  equation  for  the  third  cumulant  is  more 
interesting.  The  first  term  decreases  the  third  cumulant,  but  the  second  increases 
its  magnitude  if  it  is  negative  (which  it  typically  is).  This  is  the  part  which  depends 
upon  the  configuration  average. 

4.3  Crossover 

Crossover  can  be  treated  in  a  similar  manner  to  mutation.  Introduce  a  set  of 
crossover  variables,  which  are  1  if  a  crossed  with  /?  at  site  i  and  0  other¬ 
wise,  write  equations  for  the  cumulants  in  terms  of  these  variables  and  average. 
One  finds  that  both  the  mean  and  the  variance  remain  unchanged.  The  third  and 
higher  cumulants  are  reduced  and  brought  to  natural  values  which  depend  upon 
configurational  averages. 

4.4  Using  the  Statistical  Mechanics  Ansatz  to  Infer 
Configurational  Averages  from  Fitness  Distribution 
Averages 

In  the  previous  section  we  showed  that  the  effect  of  mutation  on  the  cumulants 
depends  upon  the  fitness  distribution  (i.e.  on  the  cumulants)  and  upon  properties 
of  the  configuration  of  the  strings.  As  we  have  argued,  one  cannot,  in  principle, 
determine  the  configuration  from  the  fitness  distribution.  We  invoke  a  statistical 
mechanics  ansatz.  We  assume  that  the  string  variables  are  free  to  fluctuate  subject 
to  the  constraint  that  the  cumulants  are  given.  In  other  words,  our  2Lssumption 
is  that  of  all  the  configurations  which  have  the  same  distribution  of  fitnesses,  the 
more  numerous  ones  are  more  likely  to  describe  the  actual  one.  This  can  be  seen 
as  a  maximum  entropy  assumption.  We  use  a  simple  statistical  mechanics  model 
to  implement  the  proposed  relationship  between  fitness  statistics  and  configuration 
statistics.  Details  are  presented  in  [6]. 

5  Discussion  and  Future  Work 

The  equations  for  each  of  the  genetic  operators  can  be  put  together  to  predict  the 
whole  dynamics.  Typical  curves  are  shown  in  figure  1.  The  theoretical  curves  were 
produced  by  calculation  of  the  initial  distribution  theoretically,  and  iterating  the 
equations  repeatedly.  No  experimental  input  was  used.  Although,  the  agreement 
between  experiment  and  theory  is  not  perfect,  these  results  are  as  accurate  as  any 
approach  of  which  we  are  aware. 

This  gives  a  picture  of  the  role  of  crossover,  for  this  simple  problem.  Selection  im¬ 
proves  the  average  fitness,  but  decreases  variance  and  introduces  a  negative  skew¬ 
ness.  Mutation  increases  the  variance,  introducing  genetic  diversity;  however,  it  also 
decreases  the  mean.  Crossover  has  no  effect  on  the  first  two  cumulants.  It  reduces 
the  magnitude  of  the  skewness,  however.  This  replaces  some  of  the  low-fitness  tail 
with  some  high-fitness  tail,  which  improves  the  best  member  of  the  population. 
Since,  for  this  problem,  crossover  has  no  effect  on  the  mean,  there  is  no  cost  in 
doing  this  and  crossover  is  helpful.  For  a  realistic  problem,  however,  crossover  will 
introduce  a  deleterious  effect  to  the  mean;  whether  this  effect  dominates  over  the 
improvement  due  to  the  decrease  of  the  third  cumulant  may  determine  whether  the 
crossover  operator  is  a  useful  one  for  the  problem  in  question. 
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Figure  1  The  solid  curves  show  the  evolution  of  the  first  two  cumulants  and  the 
fitness  of  the  best  member  of  the  population  for  N  =  127,  P  =  50,  0  =  0.1  and 
m  =  1/2N.  The  ground  state  is  at  Fbest  ~  101*  The  dashed  curve  shows  the 
theoretical  prediction.  The  overscore  indicates  averaging  over  10000  runs.  From 
reference  [6]. 
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VOLUMES  OF  ATTRACTION  BASINS  IN  RANDOMLY 
CONNECTED  BOOLEAN  NETWORKS 
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P.N. Lebedev  Physics  Institute,  Leninsky  pr.53,  Moscow,  Russia. 

The  paper  presents  the  distribution  function  of  the  volumes  of  attraction  basins  in  phase  portraits 
of  Boolean  networks  with  random  interconnections  for  a  special  class  of  uniform  nets,  built  from 
xmbiased  elements.  The  distribution  density  at  large  volumes  tends  to  universal  power  law  P  oc 
y-3/2. 

1  Introduction 

Randomly  Connected  Boolean  Networks  (RCBN)  represent  a  wide  class  of  connec- 
tionist  models,  which  attempt  to  understand  the  behavior  of  systems,  built  from  a 
large  number  of  richly  interconnected  elements  [1] . 

Since  the  early  studies  of  Kauffman  [2]  RCBN  became  a  classical  model  for  studying 
the  dynamical  properties  of  random  networks.  The  most  intriguing  feature  of  RCBN 
is  the  phase  transition  from  the  stochastic  regime  to  the  ordered  behavior,  first 
observed  in  simulations  [2],  and  later  explained  analytically  [3]- [5]. 

In  the  chaotic  phase  the  phase  trajectories  are  attracted  by  very  long  cycles,  which 
lengths  grow  exponentially  with  the  size  of  the  system  N  [4].  Thus,  even  for  not  so 
large  systems  it  is  practically  impossible  to  indicate  these  cycles,  and  their  phase 
trajectories  resemble  random  walks  in  the  phase  space. 

On  the  contrary,  in  the  ordered  phase  the  short  cycles  dominate  [4].  In  fact,  this 
paper  will  present  a  numerical  evidence,  that  almost  the  whole  system’s  phase 
space  belongs  to  the  attraction  basins  of  the  fixed  points.  The  distribution  of  the 
volumes  of  these  attraction  basins  is  an  important  characteristic,  since  it  gives  the 
probabilities  of  different  kinds  of  asymptotic  behavior  of  the  system. 

So  far,  the  distribution  of  attraction  basins  is  known  only  for  fully  connected 
Boolean  networks,  namely  the  Random  Map  Model  (RMM)  [6],  which  represents 
the  chaotic  RCBN.  The  onset  of  the  ordered  phase  implies,  that  the  mean  number 
of  inputs  per  one  element,  K,  does  not  exceed  some  critical  value,  which  depends 
only  on  the  type  of  the  Boolean  elements  [5].  Thus,  this  phase  takes  place  only  in 
the  case  of  extremely  sparse  interconnections:  K/N  ^  0  Iot  N  ^  oo. 

In  the  present  paper  we  calculate  the  distribution  of  attraction  basins  in  the  ordered 
phase,  using  the  fact,  that  in  sparsely  connected  networks  there  exists  a  correlation 
between  the  number  of  logical  switchings  at  the  consequent  time  steps,  which  is 
absent  in  the  RMM.  In  Sec.  2  we  formulate  our  model  as  the  straightforward  exten- 
tion  of  the  RMM,  which  takes  into  account  the  above  correlation.  Sec.  3  presents 
the  solution  for  our  problem  found  by  means  of  the  theory  of  branching  processes. 
Sec.  4  presents  the  comparison  of  the  theoretical  predictions  with  the  simulations 
results  for  the  Izing  neural  networks.  Sec.  5  gives  some  concluding  remarks. 

2  Model  Description 

The  networks  under  consideration  consist  of  N  two-state  automata  receiving  their 
inputs  from  the  outputs  (states)  of  other  automata,  connected  with  the  given  one. 
These  connections  are  attached  at  random  when  the  network  is  assembled  and  then 
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fixed.  The  parallel  dynamics  is  governed  by  the  map: 

x=^0(x),  (x,^G{±1}^).  (1) 

Function  </)(x)  depends  vacuously  on  some  of  its  arguments. 

In  a  phase  portrait  of  a  network  there  is  a  unique  phase  vector,  begining  at  each 
phase  state.  However,  there  are  no  restrictions  on  the  number  of  vectors  coming  to 
that  state.  Thus,  any  phase  portrait  represents  a  forest  of  the  trees  with  their  roots 
belonging  to  the  attractor  set.  For  the  ordered  phase  this  attractor  set  is  represented 
mainly  by  fixed  points. 

We  shall  be  interested  in  the  statistical  properties  of  this  forest,  namely  by  the 
distribution  of  volumes  of  the  trees.  These  statistical  properties,  of  course,  should 
characterize  not  an  individual  map,  but  some  ensemble  of  maps.  The  ensemble  of 
RCBN  contains  all  possible  variants  of  interconnections  among  N  automata,  chosen 
at  random  from  some  infinite  basic  set  of  automata.  We  assume,  that  this  set  of 
automata  is  unbiased^  that  is,  the  statistical  properties  of  phase  vectors  do  not 
depend  on  the  state  point.  Such  ensembles  will  be  further  referred  to  as  uniform 
ones. 

Such  is,  for  example,  the  RMM,  where  phase  vector  starting  in  any  state  may  come 
to  each  state  with  equal  probability.  Since  there  are  Q  such  possibilities  for  each 
state  point  (where  Q  =  2^  is  the  phase  volume  of  a  system),  this  ensemble  con¬ 
sists  of  maps,  corresponding  to  all  possible  variants  of  fully  connected  Boolean 
networks. 

In  the  absence  of  correlations  between  consequent  vectors,  one  can  characterize  the 
uniform  ensemble  by  only  one  statistical  characteristic  -  the  distribution  of  vectors 
lengths  (in  the  Hamming  sense),  Wm,  which  is  binomial: 

(  ra  (2) 

with  Wo  =  {\-\-uq)/2  being  the  mean  fraction  of  fixed  points  in  the  phase  portraits 
of  automata  from  the  basic  set  [5].  For  the  RMM,  for  example,  =  0  and  wo  =  1/2. 
But  in  the  absence  of  selfexcitable  automata  in  the  basic  automata  set  (i.e.  those, 
that  oscillate  for  the  fixed  values  of  their  inputs),  z^o  >  0.  This  leads  to  exponentially 
large  number  of  fixed  points: 

fio  =  aWo  =  (1  +  1^0)".  (3) 

In  general  case  there  exists  a  correlation  between  the  lengths  of  consequent  vectors. 
This  correlation  is  more  pronounced  in  diluted  networks.  Indeed,  recall,  that  the 
vector’s  length  is  the  number  of  automata,  which  change  state  at  the  corresponding 
time  step.  In  diluted  networks  the  probability,  that  a  given  automaton  will  change 
state  is  proportional  to  the  probability,  that  at  least  one  of  its  inputs  has  changed 
its  value.  As  a  consequence,  the  mean  number  of  automata  switchings  at  the  next 
step  is  proportional  to  that  at  the  previous  step. 

To  take  this  fact  into  account  we  introduce  the  conditional  probability  Pmi  that  in 
phase  trajectories  from  a  given  ensemble  vector  with  length  m  will  be  followed  by 
vector  of  length  /.  This  is  a  straightforward  generalization  of  the  RMM,  for  which 
Pml  -  Wz.  _  ^ 

Summarizing,  we  will  deal  with  an  ensemble  of  phase  portraits  (1)  with  the  statis¬ 
tical  characteristics  of  phase  trajectories  given  by  Wm  and  PmZ-  We  are  interested, 
however,  not  in  the  characteristics  of  trajectories,  but  rather  in  that  of  random  trees 
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in  phase  portraits  from  our  ensemble.  Such  is  the  mean  number  ILjm  of  vectors  with 
the  length  m,  which  precede  the  vector  with  the  length  /: 

Jlim  =  WmPmj/Wj,  1  =  0, (4) 

Zero-length  vectors  do  not  j^recede  any  state,  and  are  thus  excluded:  II/o  =  0.  The 
normalization  condition  ^rni  =  1  may  be  rewritten  as: 

N 

J2Win,,n  =  W,n,  (m  =  l,...,N).  (5) 

/=0 

To  complete  the  description  of  random  forest  one  needs  not  only  the  average  num¬ 
ber  Uimi  but  the  whole  distribution  function  for  the  number  of  different  vectors, 
preceding  the  given  vector  of  length  L  In  the  limit  Q  ^  oo  this  distribution  tends 
to  a  Poisson  one  with  the  generating  function: 

f(s)  =  exp[n(s  -  i)].  (6) 

The  latter  is  a  function  of  formal  parameter  s: 

//(«)=  E  . rr,.sT\...,sT, 

with  being  the  probability,  that  a  vector  of  length  /  is  preceded  by  mi 

vectors  of  length  1,  . . . ,  tun  vectors  of  length  N.  In  the  following  of  this  paper  all 
running  indexes  range  from  1  to 

3  Distribution  of  the  Basins  Volumes 

Now,  when  all  the  characteristics  of  random  trees  are  determined,  one  can  use  the 
well  known  results  of  the  theory  of  brunching  processes.  For  the  sake  of  clarity 
we  will  not  supply  the  details  here,  relegating  the  involved  calculations  to  a  more 
extensive  publication. 


3.1  General  Solution 

The  theory  of  brunching  processes  allows  one  to  find  the  distribution  of  the  volumes 
of  random  trees,  provided  the  generating  function  f{s)  is  known  [7].  For  the  one 
given  by  (6)  the  fraction  of  trees  of  volume  V  is: 

exp  (-  Ei  ki6f/2) 


HV)  =  , 

v'27rEi^iPi 

where  k,  p  and  6  satisfy  the  saddle  point  equations: 

=  -A 


(7) 

(8) 


^  ^  ^  jPi  Pj)  — Pi^t- 


(9) 


Here  ^^3  matrix  iZij  ~  8ij  ~  11*^  hcis  the  eigenvalues 

'yi  =  \  —  Ki,  being  the  relaxation  decrements  for  the  initial  Markov  process  (/c^  are 
eigenvalues  of  matrixes  P  and  II). 

The  solution  of  this  system  of  equations  reads: 

P<  =  E 


/?’ 


A„-/5’ 


(10) 
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with  Afl,  If  and  rf  being  the  eigenvalues,  the  left  and  the  right  eigenvectors  of  the 
generalized  eigenvalue  problem: 


IflTij  =  KPjij,  (li) 

i  3 

Coefficients  and  in  (10)  represent  the  spectral  expansion  of  the  known  vectors 
ki  =  lfk°/Xa,Pi  =  rfp^/Xa.  The  Lagrange  parameter  /?  is  found  from  the 


equation: 


(12) 


Multiplication  of  the  first  equation  of  (9)  on  6j  with  subsequent  summation  over  j 
gives:  kjSj  =  /?(V’  —  V"),  and  one  finally  obtains: 


exp[-l3(V  -  7)/2] 
\/2^  Ei  kiPi 


(13) 


3.2  Distribution  of  Basins  Volumes 

Distribution  (13)  is  valid  for  a  general  branching  process  with  different  types  of 
branches.  Now  we  will  make  use  of  the  specific  feature  of  our  branching  process  in 
random  phase  portraits.  Namely,  in  this  particular  case  the  first  eigenvalue  Ai  of 
the  above  generalized  eigenvalue  problem  is  much  less  than  the  others,  and  is  so 
small,  that  the  exponential  factor  in  (13)  is  negligible:  Xa  ~  O  (iV°(l  —  z^o)^/^) 
The  exponential  cut-off  of  the  distribution  (13)  occurs  at  /?  — >  Ai,  that  is  at  K 
1/Ai.  Since  1/Ai  >  fi,  the  exponential  factor  is  negligible,  and  distribution  (13)  is 
represented  by  a  power  law. 

For  (3  <Xx  one  can  leave  /?  only  in  the  term  with  m  =  1  in  (10),  obtaining: 

T{V)  oc  P3{V)-^‘\  _  (14) 

with  PsiV)  being  the  polynomial  of  the  third  degree.  For  V  ^  V  the  asymptotic 
scaling  is:  T{V)  oc  the  same  as  for  the  RMM. 

4  Simulations 

The  above  theoretical  predictions  were  checked  in  the  computer  simulations  of  neu¬ 
ral  networks  with  random  diluted  asymmetric  Izing  interconnections.  The  networks 
consisted  of  N  two  state  elements  Xi  £  {±1},  connected  with  K  randomly  chosen 
other  elements.  The  states  of  the  elements  were  updated  in  parallel  according  to 
the  rule: 

Xi  sigu(hi),  hi  =  JijCijXj  (15) 

where  C  is  the  random  interconnection  matrix  with  K  units  and  N  ~  K  zeros  in 
each  row  {Cu  =  0),  and  matrix  J  has  the  Izing  components  Jij  —  ±1  with  equal 
probabilities.  For  a  given  network  these  matrixes  are  fixed  during  the  simulations. 
If  hi  appears  to  be  zero,  the  state  of  the  i-th  neuron  is  unchanged.  The  latter  case 
may  occur  only  for  the  even  number  of  interconnections  K.  The  ordered  phase  for 
such  model  takes  place  only  for  K  <  2.  Since  the  case  =  1  is  degenerate,  we 
considered  networks  with  K  =  2. 

The  first  thing  we  examined  was  the  fraction  of  phase  space  in  the  basins  of  fixed 
points.  This  quantity  was  estimated  by  tracing  10^  random  trajectories  till  they 
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Figure  1  Distribution  density  of  basins  volumes  in  the  ensemble  of  25000  net¬ 
works  with  iV  =  11  and  K  —  2. 

reached  their  attractors.  The  results  showed  that  for  sufficiently  large  N  >  2b  all 
phase  trajectories  converge  to  some  fixed  point. 

To  obtain  the  distribution  of  the  basins  we  examined  the  whole  phase  space.  Since 
the  computational  time  grows  exponentially  with  the  size  of  the  network,  we  chose 
the  networks  of  relatively  small  size,  =  11  (the  measured  probability  of  con¬ 
vergence  was  0.97).  For  K  =  2,  25000  random  networks  were  generated,  and  each 
phase  portrait  was  examined  point  by  point.  Fig.  4  shows  the  obtained  distribution 
of  the  volumes  of  the  basins  of  fixed  points.  The  dotted  line  on  this  plot  repre¬ 
sents  the  asymptotic  power  law  T  oc  The  correspondence  is  good  in  a  wide 

range  of  volumes,  except  the  very  tail  of  the  distribution.  The  observed  cutt-olf  of 
the  power  law  is  due  to  the  finiteness  of  the  state  space,  ignored  by  the  branching 
process  approach. 

5  Conclusions 

The  paper  presented  the  distribution  of  the  attraction  basins  of  randomly  connected 
Boolean  networks,  which  generalize  the  Kauffman  NK~mode\.  The  obtained  results 
may  be  useful,  for  example,  for  studying  the  relaxation  processes  in  nonergodic 
systems.  At  finite  temperature  the  volumes  of  basins  play  a  similar  role  as  the 
energy  levels  of  the  valleys  in  spin  glasses. 
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EVIDENTIAL  REJECTION  STRATEGY  FOR  NEURAL 

NETWORK  CLASSIFIERS 
A.  Shustorovich 

Business  Imaging  Systems,  Eastman  Kodak  Company,  Rochester,  USA. 

In  this  paper  we  present  an  approach  to  estimation  of  the  confidences  of  competing  classification 
decisions  based  on  the  Dempster-Shafer  Theory  of  Evidence.  It  allows  us  to  utilize  information 
contained  in  the  activations  of  all  output  nodes  of  a  neural  network,  as  opposed  to  just  one  or 
two  highest,  as  it  is  usually  done  in  current  literature.  On  a  test  set  of  8,000  ambiguous  patterns, 
a  rejection  strategy  based  on  this  confidence  measme  achieves  up  to  30%  reduction  in  error  rates 
as  compared  to  traditional  schemes. 

1  Introduction 

Neural  network  classifiers  usually  have  as  many  output  nodes  as  there  are  com¬ 
peting  classes.  A  trained  network  produces  a  high  output  activation  of  the  node 
corresponding  to  the  desired  class  and  low  activations  of  all  the  other  nodes.  In  this 
type  of  encoding,  the  best  guess  corresponds  to  the  output  node  with  the  highest 
activation.  Our  confidence  in  the  chosen  decision  depends  on  the  numeric  value  of 
the  highest  activation  as  well  as  on  the  difference  between  it  and  the  others,  es¬ 
pecially  the  second  highest.  Depending  on  the  desired  level  of  reliability,  a  certain 
percentage  of  classification  patterns  can  be  rejected,  and  the  obvious  strategy  is  to 
reject  patterns  with  lower  confidence  first.  The  rejection  schemes  most  commonly 
used  in  literature  are  built  around  two  common-sense  ideas:  the  confidence  increases 
with  the  value  of  the  highest  activation  and  with  the  gap  between  the  two  highest 
activations. 

In  some  systems,  only  one  of  these  values  is  used  as  the  measure  of  confidence; 
in  others,  some  ad-hoc  combination  of  both.  For  example,  Battiti  and  Colla  [1] 
reject  patterns  with  the  highest  activation  (HA)  below  a  fixed  threshold,  and  also 
those  with  the  difference  between  the  highest  and  the  second  highest  activation 
(DA)  below  a  second  fixed  threshold.  Martin  et  al.  [5]  use  only  DA  in  their  Sac- 
cade  System,  whereas  Gaborski  [4]  uses  the  ratio  of  the  highest  and  the  second 
highest  activation  (RA)  as  the  preferred  measure  of  confidence.  Fogelman-Soulie 
et  al.  [3]  propose  a  distance-based  (DI)  rejection  criterion  in  addition  to  HA  and 
DA  schemes;  namely  they  compare  the  Euclidean  distance  between  the  activation 
vector  and  the  target  activation  vectors  of  all  classes  and  reject  the  pattern  if  the 
smallest  of  these  distances  is  greater  than  a  fixed  threshold.  Bromley  and  Denker 
[2]  mention  experiments  conducted  by  Yann  Le  Cun  in  support  of  their  choice  of 
the  DA  strategy. 

Logically,  one  expects  the  two  highest  activations  to  provide  almost  all  the  informa¬ 
tion  for  the  rejection  decision,  because  usually  ambiguities  involve  a  choice  between 
the  right  classification  and,  at  most,  one  predominant  alternative.  Still,  ideally,  we 
would  prefer  a  formula  combining  all  the  output  activations  into  a  single  measure  of 
confidence.  Such  formula  can  be  derived  if  we  view  this  problem  as  one  of  integra¬ 
tion  of  information  from  different  sources  in  the  framework  of  the  Dempster-Shafer 
Theory  of  Evidence  [6] .  We  can  treat  the  activation  of  each  output  node  as  the 
evidence  in  favor  of  the  corresponding  classification  hypothesis  and  combine  them 
into  the  measure  of  confidence  of  the  final  decision. 
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2  The  Dempster- Shafer  Theory  of  Evidence 

The  Dempster-Shafer  Theory  of  Evidence  is  a  tool  for  representing  and  combining 
measures  of  evidences.  This  theory  is  a  generalization  of  Bayesian  reasoning,  but  it 
is  more  flexible  than  Bayesian  when  our  knowledge  is  incomplete,  and,  therefore, 
we  have  to  deal  with  uncertainty.  It  allows  us  to  represent  only  actual  knowledge, 
without  forcing  us  to  make  a  conclusion  when  we  are  ignorant. 

Let  0  be  a  set  of  mutually  exhaustive  and  exclusive  atomic  hypotheses  0  = 
{^1, . . .,  ^iv}.  0  is  called  the  frame  of  discernment  Let  2®  denote  the  set  of  all 
subsets  of  0.  A  function  m  is  called  a  basic  probability  assignment  if 

m:2®->[0,l],  m(0)  =  O,  and  ^  m(A)  1.  (1) 

>1C0 

In  the  probability  theory,  a  measure  of  probability  a,na\ogo\is  to  the  basic  probability 
assignment  is  associated  only  with  the  atomic  hypotheses  The  probabilities  of 
all  other  subsets  are  derived  as  sums  of  probabilities  of  its  atomic  components, 
and  there  is  nothing  left  to  express  our  measure  of  ignorance.  In  the  Dempster- 
Shafer  theory,  m(A)  represents  the  6e/ie/that  is  committed  exactly  to  every  (not 
necessarily  atomic)  subset  A.  It  is  useful  to  “. .  .think  of  these  portions  of  belief  as 
semi-mobile  'probability  masses’,  which  can  sometimes  move  from  point  to  point 
but  are  restricted  in  that  various  of  them  are  [. . .]  confined  to  various  subsets  of 
0.  In  other  words,  m(A)  measures  the  probability  mass  that  is  confined  to  A,  but 
can  move  freely  to  every  point  of  A”  [6].  Now  it  is  easier  to  understand  why  m(0) 
represents  our  measure  of  ignorance:  this  portion  of  our  total  belief  is  not  committed 
to  any  proper  subset  of  0. 

The  simplest  possible  type  of  the  basic  probability  assignment  is  a  simple  evidence 
function,  that  corresponds  to  the  case  in  which  all  our  belief  is  concentrated  in  only 
one  subset  E  C  0,  its  focal  element.  If  F  ^  0,  then  we  can  define  m(F)  =  s, 
m(0)  =  1  —  s,  and  m  is  0  elsewhere.  The  degree  of  support  s  represents  our  belief 
in  F,  while  1  -  s  represents  our  ignorance. 

The  Dempster-Shafer  theory  provides  us  with  the  rule  for  the  combination  of  evi¬ 
dences  from  two  sources.  This  combination  m  =  mi  ©m2  is  called  their  orthogonal 
sum: 

m(yl)  =  K  ■  mi(X)  •  m2(y)  where  K-^  =  Y  >ni(^)  '  m2(y).  (2) 

XnY=A  XnYjtfb 

The  main  idea  expressed  in  this  nonintuitive  formula  is  actually  rather  simple:  the 
portion  of  belief  committed  to  the  intersection  of  two  subsets  X  and  Y  should  be 
proportional  to  the  product  of  m(A)  and  m(y)  (Figure  1).  Only  if  this  intersection 
is  empty,  do  we  have  to  exclude  it  from  the  sum  and  renormalize  it.  Obviously,  this 
rule  can  be  generalized  to  accommodate  for  more  than  two  sources  of  information. 
Please  refer  to  [6]  for  more  detailed  explanations. 

3  Combination  of  Evidences  as  the  Measure  of  Confidence 

Consider  a  frame  of  discernment  0  =  {^i, . . . ,  ^^v},  where  On  is  the  hypothesis  that 
a  pattern  belongs  to  class  n.  The  activation  of  the  n-th  output  node  of  a  neu¬ 
ral  network,  On,  provides  information  in  favor  of  the  corresponding  hypothesis  On. 
Our  partial  belief  based  on  this  information  can  be  expressed  as  a  simple  evidence 
function 


m„(0)  = 


(3) 
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m2(ym) 

m2(Yj) 

m2(yi) 

Figure  1  The  Dempster  combination  rule. 


where  (j)n  is  a  monotonic  function  of  .  We  shall  discuss  a  specific  form  of  this 
function  later. 

Now  we  have  to  combine  these  evidences  into  their  orthogonal  sum  according  to 
eqn.  (2).  Let  us  start  with  the  case  in  which  N  =  2.  The  first  simple  evidence 
function  mi,  with  its  focal  element  ^i,  has  the  degree  of  support  Its  only  other 
nonzero  value  is  mi(0)  =  I  -  <i>i-  Similarly,  ^2  is  the  focal  element  of  m2,  with  a 
degree  of  support  of  <^2,  and  m2(0)  =  1  -  <5^>2-  When  we  produce  the  orthogonal 
sum  m  =  mi  ®  m2,  we  have  to  compute  beliefs  for  the  total  of  four  subsets  of  0:  0, 
61,62,  and  0  (Figure  2).  The  value  m(0)  should  be  proportional  to  mi(0)  •  m2(0) 

1112(0)  -l-(l>2 

^2(62)  =  h 


^1  :  •  (1  -  <^2) 

0:(l-«ii)-(l-«S2) 

0  :  •  <^2 

^2  :  (1  —  ^1)  ■  <^2 

mi(0i)  =  (?5>i  mi(0)  =  1  -  <^i 


Figure  2  Orthogonal  sum  of  two  simple  evidence  functions  with  atomic  foci. 


because  this  is  the  only  way  to  obtain  0  as  the  intersection.  Similarly,  m(^i)  should 
be  proportional  to  mi(0i)  •  m2(0),  while  m(^2)  should  be  proportional  to  m2(^2) ' 
mi(0).  The  only  way  we  can  produce  0  as  an  intersection  of  a  subset  with  nonzero 
mi-evidence  and  another  subset  with  nonzero  m2-evidence  is  0  ==  61  f]  62.  The 
corresponding  product,  ^1  •  <f>2,  should  be  excluded  from  the  normalization  constant 
K.  Finally, 

m(0i)  =  K  •4>i^(l-  h)  ,  111(^2)  -  K 

It  is  obvious  that  if  both  <^i  1  and  ^2  =  1}  formula  cannot  work  because 

it  involves  division  of  However,  this  is  quite  appropriate,  because  it  indicates 
that  both  mi  and  m2  are  absolutely  confident  and  flatly  contradict  each  other.  In 
such  a  case,  there  is  no  hope  of  reconciling  the  differences.  If  only  one  of  the  two 
degrees  of  support  equals  1,  for  example,  <^i  =  1,  then  Jn{6i)  =  1,  m(02)  =  0,  and 
m(0)  =  0.  In  this  case,  m2  cannot  influence  the  absolutely  confident  mi. 

If  neither  of  the  degrees  of  support  equals  1,  we  can  denote  an  =  and  trans- 
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Using  mathemaical  induction,  one  can  prove  that  when  we  have  to  combine  N 
simple  evidence  functions  with  atomic  focal  elements,  their  orthogonal  sum  becomes 

an  1 


m(0„; 


m(0)  = 


(6) 


1  +  ^  “h  Yli 

The  combined  evidence  m(^„)  represents  our  share  of  belief  (or  confidence)  associ¬ 
ated  with  each  of  N  competing  classification  decisions.  Our  formula  has  a  number 
of  intuitively  attractive  properties: 


■  the  highest  confidence  corresponds  to  the  highest  activation, 

■  the  confidence  increases  with  the  growth  of  the  corresponding  activation, 

■  the  confidence  decreases  with  the  growth  of  any  other  activation. 

Now  we  can  rate  all  patterns  according  to  the  confidence  of  the  best-guess  classifi¬ 
cation  and  reject  them  in  order  of  increasing  confidence.  We  will  call  this  rejection 
strategy  “evidential” . 

4  The  Experiment 

We  tested  the  proposed  method  on  the  handprint  digits  classification  problem.  A 
trained  neural  network  classifier  was  used  to  generate  sets  of  activations  for  two 
test  files,  both  of  approximately  25,000  characters.  To  accentuate  the  comparisons, 
unambiguous  patterns  (those  with  the  highest  activation  above  0.9  and  all  other  ac¬ 
tivations  below  0.1)  were  excluded  from  the  experiment.  Indeed,  one  can  argue  that 
any  responsible  rejection  strategy  should  accept  such  patterns,  especially  because 
many  neural  networks  are  trained  with  target  activations  of  0.9  and  0.1  (sometimes 
even  0.8  and  0.2)  instead  of  1  and  0.  In  addition,  the  error  vs  reject  curves  dif¬ 
fer  for  different  test  sets  depending  on  the  quality  of  the  data.  The  exclusion  of 
unambiguous  patterns  allows  us  to  normalize  the  curves  to  some  extent. 

We  found  approximately  6,500  ambiguous  patterns  in  the  first  set,  and  approxi¬ 
mately  8,000  in  the  second  set.  The  first  subset  was  used  to  select  the  best  mono¬ 
tonic  transformation  of  the  activations  into  the  support  functions.  Comparison  of 
a  large  number  of  possible  functions  led  to  the  choice  of  the  A(x)  —  log2(l  -|-  a;)  as 
the  seed  of  a  whole  family  of  transformations  A*,  and  finally  we  chose  =  A^^. 
The  evidential  (EV)  rejection  strategy  was  then  used  on  the  second  subset,  and  its 
results  were  compared  to  those  achieved  with  the  rejection  scheme  based  on  the 
highest  activation  (HA),  the  difference  between  the  highest  and  the  second  highest 
activation  (DA),  the  ratio  of  the  highest  and  the  second  highest  activation  (RA), 
and  the  distance-based  rejection  criterion  (DI)  described  above.  Figure  3  presents 
the  comparison  of  error  vs  reject  graphs  for  all  the  five  schemes  studied.  To  compute 
data  for  each  of  the  curves,  all  patterns  are  sorted  according  to  the  corresponding 
confidence  measure.  Then  the  patterns  are  rejected  one  by  one,  and  the  number  of 
remaining  misclassifications  is  counted.  Their  percent  of  the  total  number  of  the 
remaining  patterns  constitutes  the  error  rate  which  is  plotted  against  the  percent 
reject.  Obviously,  there  is  no  difference  among  the  strategies  at  zero  and  hundred 
percent  reject. 

5  Conclusions 

In  this  paper  we  demonstrated  how  the  Dempster-Shafer  Theory  of  Evidence  can 
provide  the  framework  for  estimation  of  the  confidences  of  competing  classification 
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Figure  3  Percent  error  vs  percent  rejected  for  different  strategies. 


decisions.  Although  we  worked  with  activations  generated  by  a  neural  network,  this 
approach  can  be  applied  if  it  is  necessary  to  combine  the  outputs  of  any  collection 
of  single-class  recognizers. 

Our  formula  consistently  utilizes  all  the  output  activations  of  a  neural  network 
classifier.  Extensive  experimentation  allowed  us  to  select  a  simple  monotonic  trans¬ 
formation  of  output  activations  as  the  basic  probability  assignment.  The  testing  we 
conducted  on  the  8,000  strong  set  of  ambiguous  patterns  confirmed  that  the  evi¬ 
dential  rejection  strategy  reduces  the  classification  error  up  to  30%.  It  is  necessary 
to  mention  here  that  the  best  choice  of  the  monotonic  transformation  for  the  basic 
probability  assignment  depends  on  the  concrete  application. 

Further  work  is  needed,  for  we  have  not  analyzed  any  information  we  can  obtain 
from  the  distribution  of  output  activation  values  for  different  classes.  Another  pos¬ 
sible  line  of  action  is  the  analysis  of  the  confusion  matrix  or,  broader  still,  the  joint 
distribution  of  activation  pairs. 
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Discrete  dynamical  systems  and  nonlineeu'  response  models  contaminated  with  random  noise  are 
considered.  The  objective  of  the  paper  is  to  propose  a  method  of  extraction  of  change-points  of  a 
mapping  (system)  from  a  neural  network  approximation.  The  method  is  based  on  the  comparison 
of  normalized  Taylor  series  coefficients  of  the  estimator  and  on  the  consistency  of  the  estimator  and 
its  derivatives.  The  proposed  algorithm  is  illustrated  on  a  simple  piecewise  polynomial  mapping. 

1  Introduction 

Consider  real- valued  variables  y  as  output  and  x  as  input  of  a  system.  Their  relation 
is  modeled  by  a  general  piecewise  smooth  function  /  :  [0,  d]  IR  in  a  model 

y  =  f{x)  +  e,  (1) 

where  e  stands  for  random  noise.  The  analysis  of  the  function  f{x)  will  be  based 
on  a  sequence  of  observation  data,  Xi,  yi,  generated  by  random  variables  Xi,Yi,  i  = 
We  assume  that  random  variables  ei  =  Yi  -  f{Xi),  representing  random 
noise  entering  the  system,  are  mutually  independent  and  identically  distributed, 
with  mean  zero  and  finite  variance  cr^.  A  discrete  dynamical  system 

yt  =  9{yt-h)-\-et,  (2) 

can  be  viewed  as  a  system  of  the  type  (1)  where  we  set  xt  =  yt-h  and  yt  —  g{xi)-\-et. 
Thus  the  estimation  of  the  generator  g  of  the  equation  (2)  is  transformed  into  a 
regression  problem  of  the  type  (1). 

Throughout  the  paper  we  will  be  mainly  concerned  with  models  (1)  and  (2).  How¬ 
ever,  the  proposed  solution  can  be  extended  to  the  multivariable  nonlinear  function 
yi  =  f{xii,Xi2, Xik)  +  Ci  and  to  autoregressive  systems,  dynamic  regression  and 
recursive  response  cases,  e.g.  yt  —  g{yt,yt..h,  ■•■■,yt-mh)  +  present  paper 

considers  the  use  of  a  feedforward  neural  network  approach.  That  is  the  spline  net¬ 
work  (as  a  representative  of  feedforward  neural  nets)  will  be  used  as  the  tool  for 
estimation  of  the  model  functions. 

2  Polynomial  Splines  and  Feedforward  Neural  Networks 

A  neural  network  is  a  system  of  interconnected  nonlinear  units.  The  topology  of 
connections  between  units  can  be  complicated.  In  the  simplest  scenario  a  neural 
network  model  is  no  more  than  the  sum  of  units.  The  nonlinear  units  are  either 
localized  (e.g.  splines  and  radial  basis  functions)  or  their  sums  form  localized  func¬ 
tions  (e.g.  perceptrons).  The  local  nature  of  the  units  has  been  exploited  for  clas¬ 
sification/pattern  recognition  problems,  e.g.  in  [4].  We  will  show  usefulness  of  local 
processing  units  for  the  function  change-point  estimation  problems. 

2.1  Polynomial  Spline  Functions 

A  polynomial  spline  function  f(x)  is  a  function  that  consists  of  polynomial  pieces 
/(ic)  ^  fi{x)on[\i,  (2  =  0,...,  M)  where  fi  is  a  polynomial  of  degree  k.  The 
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points  Ao,  Ai, Ajv/  at  which  the  function  f(x)  changes  its  character  are  termed 
break  points  or  knots  in  the  theory  splines  (other  authors  prefer  ’’joints”  or  ’’nodes”) 
The  polynomial  pieces  are  joined  together  with  certain  smoothness  conditions.  The 
smoothness  conditions  are  lima;_A,-  =  limj;^Ai+  {j  =  1, Ar  — 1). 

In  other  words,  a  spline  function  of  degree  ^  >  0,  having  as  break  points  the 
strictly  increasing  sequence  Ay ,  (j  =  0, 1,  ..M  +  1;  Aq  =  0,  Am+i  =  d),  is  a,  piecewice 
polynomial  on  intervals  [Ay,  Ay+i]  of  degree  k  at  most,  with  continuous  derivatives 
up  to  order  k  ~  1  on  [0,c(|.  It  is  known  that  a  uniformly  continuous  function  can 
be  approximated  by  polynomial  splines  to  arbitrary  accuracy,  see  for  instance  de 
Boor  [2].  This  concept  can  be  naturally  extended  to  higher  dimensions.  The  set 
of  spline  functions  is  endowed  with  the  natural  structure  of  a  finite  vector  space. 
The  dimension  of  the  vector  space  of  spline  functions  of  degree  k  with  M  interior 
break  points  ism  =  M  +  ^H-l.  A  neural  network-like  basis  of  the  space  of  spline 
functions  is  the  basis  of  B-splines.  Each  spline  function  can  be  written  as  a  unique 
linear  combination  of  basis  functions. In  this  paper  we  shall  use  the  basis  of  B-splines, 
which  may  be  defined,  for  example,  in  a  recursive  way  (cf.  de  Boor  [2]).  Let  us  define 
{A_jfc  A_jfc+i  ...  =  Aq  =  0  <  Ai  <  A2  <  ...  <  Ajvf  <  d  =  \m+i  =  •••  =  A^n}  an 
extended  set  of  M  +  2A;  +  2  knots  associated  with  the  interval  [0,  c?].  Each  B-spline 
basis  function  (a  unit  in  neural  network  terminology)  Bj{x)  is  completely  described 
by  a  set  of  ^  +  2  knots  {Ay_fc_i,  Ay Ay}.  There  are  two  important  properties 
of  these  units  [2],  namely  they  are  nonnegative,  vanish  everywhere  except  on  a 
few  contiguous  subintervals  Bj{x)  >  0  if  Ay_jb_i  <  x  <  Xj,  Bj{x)  —  0  otherwise, 
and  they  define  a  partition  of  unity  on  [0,d],  ^ji^)  ~  ^  all  a?e[0,d].  We 

shall  mainly  deal  with  cubic  {k  =  3)  and  quadratic  {k  =  2)  B-splines  because 
of  their  simplicity  and  importance  in  applications.  Higher-degree  splines  are  used 
whenever  more  smoothness  is  needed  in  the  approximating  function.  For  a  chosen 
number  m  of  units,  the  resulting  polynomial  spline  is  a  function  fm  from  the  set 
=  {fm  '[0,d\-^JR\  fm{x)  =  Yfjli  ^jBf  {x)}.By  5*  =  Um‘^m  denote  the 
set  of  all  univariate  splines  of  degree  k  (order  + 1).  All  these  properties  allow  us  to 
think  of  spline  functions  as  neural  networks.  The  numbers  wj  are  called  weights  (in 
statistics  and  neural  network  theory)  or  control  points  (in  computer  aided  geometric 
design). 

2.2  Feedforward  Neural  Networks 

The  virtue  of  neural  networks  in  our  approach  is  their  capability  to  capture  more 
information  about  the  true  system  structure  then,  e.g.  by  polynomials.  There  is 
a  number  of  methods  for  estimation  of  the  network  parameters.  For  example,  in¬ 
cremental  learning  algorithms  are  described  in  Fahlman  and  Lebier  [4]  or  Jones’s 
approach  in  [5].  We  will  assume  that  neural  network  parameters  have  been  estimated 
by  one  of  the  available  learning  algorithms.  In  the  case  of  B-spline  networks,  the 
simplest  approach  computes  the  weights  by  the  least  squares  method  and,  eventu¬ 
ally,  iterates  the  placement  of  knots  in  order  to  optimize  it.  Neural  networks  do  not 
provide  immediate  insight  into  how  the  modeled  mechanism  works.  This  insight 
can  be  gained  by  solving  an  extraction  problem.  What  we  mean  by  an  extraction 
problem  for  a  neural  net  is  how  to  extract  the  underlying  structure  of  dependence 
in  terms  of  a  local  polynomial  regression  model.  This  framework  is  useful  when  a 
simple  explicit  description  (e.g.  in  terms  of  the  low-degree  polynomial  regression)  is 
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required.  Also  this  technique  is  more  feasible  than  a  ’direct’  nonlinear  polynomial 
regression  for  systems  where  the  generator  g  is  changing  in  time. 

3  Extraction  of  a  Piecewise  Polynomial  Model 

Why  we  should  use  nets  instead  of  polynomials?  It  is  well  known  that  the  set  of 
all  polynomials  is  dense  in  the  space  of  all  continuous  functions  (Stone- Weierstrass 
theorem).  However,  for  practical  purposes  approximation  by  a  single  polynomial 
may  create  polynomials  of  prohibitively  high  degree.  These  high  degree  polynomi¬ 
als  most  likely  do  not  reflect  the  nature  of  underlying  response  function.  A  neural 
network  may  perform  approximation  tasks  more  efficiently.  Important  questions 
include  how  to  extract  the  structure  of  the  function  /.  When  speaking  about  ex¬ 
traction  of  the  underlying  polynomial  structure  we  envision  a  target  function  to 
be  a  polynomial  or  a  piecewise  polynomial.  These  functions  admit  a  local  approxi¬ 
mation  by  a  lower  degree  polynomial.  This  local  approximation  can  be  arbitrarily 
accurate  provided  that  the  interval  over  which  we  approximate  is  sufficiently  small. 
This  is  true  even  across  the  break  points  of  a  piecewise  polynomial.  Polynomial 
extraction  and  diagnosis  of  the  point  of  change  is  based  on  the  consistency  of  the 
approximation  and  its  derivatives,  i.e.  on  their  convergence  to  the  target  function 
and  its  derivatives. 

It  is  well  known  that  a  smooth  function  /  and  its  derivatives  can  be  approximated 
to  arbitrary  accuracy  by  splines.  This  stems  for  example  from  the  error  bounds  for 
polynomial  splines  and  their  derivatives  established  by  Hall  and  Meyer  [3],  Marsden 
[6].  For  cubic  spline  interpolants  the  error  bound  is 

ll(/-/m)<’'’lL  <  =  0,1,2,3)  (3) 

where  h  is  the  mesh  gauge  and  Cr  are  constants.  For  noisy  data  the  consistency  of 
the  functional  need  to  be  extended  for  the  statistical  estimators  as  well. 

If  we  assume  that  the  response  function  /  is  smooth  and  has  bounded  derivatives 
then  it  can  be  locally  approximated  to  arbitrary  accuracy  by  the  Taylor  polynomial 
of  degree  k.  The  Lagrange’s  remainder  term  of  the  Taylor  expansion  at  the  point  a 
has  the  form 

Rt  =  -  a)S  +  a),  0  <  <  1  (4) 

This  is  particularly  true  for  =  2, 3  when  the  mesh  gauge  h  =  maxi\\ij^\  —  Ai|,  i  = 
0,...n  —  1  is  sufficiently  small.  It  follows  from  the  equations  (3),  (4)  that  good 
approximation  of  the  smooth  response  function  /  implies  good  approximation  of  the 
coefficients  of  the  local  Taylor  expansion  of  /.  It  means  that  one  polynomial  piece 
will  be  approximated  by  a  spline  having  polynomial  pieces  with  the  same  coefficients 
at  a  common  point.  If  /  is  not  smooth,  i.e.  it  has  a  finite  set  of  active  break  points, 
then  good  approximation  by  splines  is  still  possible  except  for  arbitrarily  small 
neighborhoods  around  break  points.  In  the  neighborhood  of  a  break  point  of  / 
the  polynomial  piece  of  the  spline  approximation  will  reflect  abrupt  change  of  the 
function  /.  This  approach  can  be  generalized  for  noisy  data.  We  will  utilize  the 
results  of  Stone  [8]  showing  that  a  sequence  of  the  optimal  estimates  fmn  ^  of 
/  from  a  noisy  data  set  (xi,  yi),  {x2,  y2),  (xn,  Vn)  converges  to  /  as  m  and  n  go  to 
infinity,  and  that  ,the  same  holds  for  derivatives  fihl,  for  k  not  greater  than 
the  degree  of  estimating  splines,  provided  is  Lipschitz  continuous. 
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4  Tests  for  Functional  Change 

A  gradually  changing  response  function  will  produce  gradually  changing  normalized 
polynomial  coefficients.  By  normalized  coefficients  we  mean  coefficients  of  polyno¬ 
mial  pieces  expanded  about  a  common  point.  An  abrupt  change  in  response  func¬ 
tion  will  be  reflected  by  an  abrupt  change  in  the  normalized  coefficients.  To  test 
for  the  points  of  change  in  the  behavior  of  the  neural  network  approximation  fmn 
the  polynomial  pieces,  with  coefficients  Pij(Xj)  were  recalculated  about  a  common 
point  a?  —  0.  A  set  of  new  coefficients  Pij{0,Xj)  was  obtained  and  we  call  those 
coefficients  normalized  Taylor  coefficients.  The  index  i  refers  to  the  degree  of  the 
piece  pij  and  j  to  the  jth  interval.  The  points  with  the  same  normalized  coefficients 
indicate  no  change  of  the  functional  dependence. 

For  stationary  functions/systems  all  normalized  coefficients  will  be  constant  or  al¬ 
most  constant  (due  to  noise  and  approximation  errors). 

5  Large  Sample  Properties 

To  establish  the  consistency  of  the  estimator  fmn  and  its  derivatives  we  recall 
the  results  of  Stone  [8]  who  deals  with  optimal  rates  of  convergence  of  response 
function  estimator  based  on  the  spline  functions  fmnix)  =  EjLi  (a^)-  This 
representation  is  a  very  useful  one  because  it  casts  the  spline  into  a  multiple  linear 
regression  context.  If  the  set  of  knots  is  fixed,  the  estimator  may  be  found  as  the 
least  squares  solution  in  parameters  Wj,j  =  l,..,m.  Then  the  vector  of  weights 
w  =  (Z'^ Z)~^ Z'^Y ,  where  Y  =  (j/i, ...,  and  Z  is  a  (n,m)  matrix  such  that 
Zij  =  Bf(xi)  .  Let  us  denote  Cp  a  set  of  all  functions  /  :  [0,  d]  — ^  IR  having 
continuous  the  pth  derivatives,  ||.||  the  L2  norm  on  [0,d],  and  E{.)  the  mean  operator. 
For  given  sequences  of  numbers  a„,6„,  bn  >  0,  by  a„  =  0(6n)  we  mean  that 
sequence  Unjbn  is  bounded.  When  are  random  variables,  Plim„_,.oo  An  =  0 
means  that  for  each  e  >  0  limn-^ooP{\An\  <  e)  =  1.  For  simplicity  we  assume  that 
data  points  xi,.,.,Xn  and  interior  knots  (break  points)  are  uniformly  distributed 
in  the  interval  [0,d].  To  assure  good  approximation  of  the  response  function  we 
assume  that  we  sampled  more  data  points  than  the  number  of  B-spline  units,  i.e. 
m  =  0{rP)  ,  7e(0, 1). 

Proposition  1  Suppose  f  is  from  Ci.  Then  E\\fmn  -  f\\^  =  0(n^“^)  -f 

(This  follows  from  the  Lipschitz  continuity  of  f).  Suppose  f  is  from  Cp,p>  0  and  k 

is  the  degree  of  estimating  spline.  Let  7  —  2^^.  Then^  for  /  =  0, 1, ..,  min{k,p—  1) 
U  holds  that  E\\Mi  -  =  0(n-2’-'),w/,ere  r,  = 

The  consistency  of  the  derivatives  gives  supports  diagnostics  based  on  the  Taylor 
expansion,  especially  for  an  analysis  of  the  low-order  derivatives.  The  procedure  of 
the  model  retrieval  is  an  iterative  process  combining  techniques  of  estimation  with 
methods  of  testing,  i.e.  of  comparing  the  model  with  the  actual  behavior  of  the 
system. 

6  Numerical  Example  of  Estimation  of  1-D  Discrete  Piecewise 
Quadratic  Dynamical  System 

The  algorithm  has  been  applied  on  a  time  series  of  satellite  data  set,  see  reference 
[7].  Here,  for  the  numerical  illustration,  let  us  consider  a  following  simple  example: 
A  discrete  dynamical  system  is  defined  by  two  quadratic  maps:  /„  =  4/„_i  — 
4/^-1, /n-i  <  0.5  and  /„  =  1.2  +  0.4/n_i  -  >  0.5.  This  map 
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Figure  3  Qdr.  CoefF.  at  x  =  0;  ai  =  Figure  4  y(n)  vs.  y{n  -  1)  for  n  =  1  to 

-4,  02  =  -1.6.  800. 


Figure  5  Spline  fit  error. 


Figure  6  Quadratic  Map  time  series. 


is  from  the  family  of  quadratic  maps  that  is  known  to  be  chaotic  and  ergodic. 
The  time  series  generated  by  this  map  cannot  be  predicted  for  many  steps  ahead 
but  the  generator  of  this  series  can  be  identified  from  the  time  series.  We  used 
a  quadratic  spline  fmn  with  40  interior  break  points  to  identify  sets  of  polyno¬ 
mial  coefficients.  Pieces  of  the  spline  fmn  are  represented  by  local  polynomials 
a^{x  -  -I-  hj(x  -  \j)  +  Cj,(i  =  0, 1,  •  ■  -  ,39).  Figures  1,  2  and  3  show  poly¬ 

nomial  coefficients  0^(0),  6j(0),  Cj(0).  These  coeffiecient  were  recalculated  from  the 
local  coefficients  at  the  common  break  point  Aq  =  0.  Figure  4  shows  the  piecewise 
quadratic  map.  Figure  5  shows  the  error  of  the  spline  fit.  Figure  6  shows  400  points 
of  the  chaotic  time  series  generated  by  the  piecewise  quadratic  map  and  additive 
gaussian  noise  7V(0,cr^  =  0.0001).  The  dynamical  problem  then  becomes  the  fol¬ 
lowing  approximation  test:  a  domain  is  randomly  partitioned  into  a  finite  number 
of  subintervals.  On  each  subinterval  are  generated  data  by  a  noisy  polynomial.  The 
aim  of  the  analysis  is  to  find  those  polynomials  and/or  to  find  all  break  points. 
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7  Concluding  Remarks 

We  have  proposed  a  method  employing  the  B-splines  network  for  estimation  of  an 
unknown  response  function  from  noisy  data,  and  for  extraction  a  potential  piecewise 
polynomial  submodel.  It  has  been  shown  how  the  first  terms  of  the  Taylor  expansion 
of  the  estimator,  when  expanded  at  different  points  of  its  domain,  can  be  utilized 
for  revealing  the  degree  of  the  underlying  polynomial  model,  and  for  the  detection 
of  changes  in  the  model.  This  approach  can  be  adapted  for  multidimensional  and 
discrete  dynamical  systems  (3)  or  (4).  In  these  cases,  we  have  to  work  with  the 
partial  derivatives.  However,  the  multivariate  case  leads  to  the  use  of  multivariate 
approximating  functions.  In  such  a  situation,  the  estimation  becomes  less  effective  ( 
a  much  greater  amount  of  data  is  needed  ) ,  even  the  theoretical  rate  of  convergence 
is  slower.  Many  authors  (cf.  Breiman[l],  Stone[8])  recommend  the  use  of  additive 
approximations  with  low-dimensional  components. 
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In  supervised  learning,  the  redundancy  contained  in  random  examples  can  be  avoided  by  learning 
from  queries,  where  training  ex^lmples  are  chosen  to  be  maximally  informative.  Using  the  tools  of 
statistical  mechanics,  we  aneilyse  query  learning  in  a  simple  multi-layer  network,  namely,  a  large 
tree-committee  machine.  The  genereJization  error  is  found  to  decrease  exponentially  with  the 
number  of  training  examples,  providing  a  significant  improvement  over  the  slow  algebraic  decay 
for  r^lndom  examples.  Implications  for  the  connection  between  information  gain  and  generalization 
error  in  multi-layer  networks  are  discussed,  and  a  computationally  cheap  algorithm  for  constructing 
approximate  maximum  information  gain  queries  is  suggested  and  analysed. 

1  Introduction 

In  supervised  learning  of  input-output  mappings,  the  traditional  approach  has  been 
to  study  generalization  from  random  examples.  However,  random  examples  contain 
redundant  information,  and  generalization  performance  can  thus  be  improved  by 
query  learning,  where  each  new  training  input  is  selected  on  the  basis  of  the  existing 
training  data  to  be  most  ‘useful’  in  some  specified  sense.  Query  learning  corresponds 
closely  to  the  well-founded  statistical  technique  of  (sequential)  optimal  experimental 
design.  In  particular,  we  consider  in  this  paper  queries  which  maximize  the  expected 
information  gain,  which  are  related  to  the  criterion  of  (Bayes)  D-optimality  in  op¬ 
timal  experimental  design.  The  generalization  performance  achieved  by  maximum 
information  gain  queries  is  by  now  well  understood  for  single-layer  neural  networks 
such  as  linear  and  binary  perceptrons  [1,  2,  3].  For  multi-layer  networks,  which  are 
much  more  widely  used  in  practical  applications,  several  heuristic  algorithms  for 
query  learning  have  been  proposed  (see  e.g.,  [4,  5]).  While  such  heuristic  approaches 
can  demonstrate  the  power  of  query  learning,  they  are  hard  to  generalize  to  sit¬ 
uations  other  than  the  ones  for  which  they  have  been  designed,  and  they  cannot 
easily  be  compared  with  more  traditional  optimal  experimental  design  methods. 
Furthermore,  the  existing  analyses  of  such  algorithms  have  been  carried  out  within 
the  framework  of  ‘probably  approximately  correct’  (PAG)  learning,  yielding  worst 
case  results  which  are  not  necessarily  close  to  the  potentially  more  relevant  average 
case  results.  In  this  paper  we  therefore  analyse  the  average  generalization  perfor¬ 
mance  achieved  by  query  learning  in  a  multi-layer  network,  using  the  powerful  tools 
of  statistical  mechanics.  This  is  the  first  quantitative  analysis  of  its  kind  that  we 
are  aware  of. 

2  The  Model 

We  focus  our  analysis  on  one  of  the  simplest  multi-layer  networks,  namely,  the  tree- 
committee  machine  (TCM).  A  TCM  is  a  two-layer  neural  network  with  N  input 
units,  K  hidden  units  and  one  output  unit.  The  ‘receptive  fields’  of  the  individual 
hidden  units  do  not  overlap,  and  each  hidden  units  calculates  the  sign  of  a  linear 
combination  (with  real  coefficients)  of  the  N/K  input  components  to  which  it  is 
connected.  The  output  unit  then  calculates  the  sign  of  the  sum  of  all  the  hidden 
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unit  outputs.  A  TCM  therefore  effectively  has  all  the  weights  from  the  hidden  to 
the  output  layer  fixed  to  one.  Formally,  the  output  y  for  a  given  input  vector  x  is 

T/ =  sgn  fTi)  cr,- =  sgn(xfw,)  (1) 

where  the  aj  are  the  outputs  of  the  hidden  units,  their  weight  vectors,  and 
=  (x7,  ■  •  >  jxJ)  with  Xi  containing  the  N/K  (real- valued)  inputs  to  which  hid¬ 
den  unit  i  is  connected.  The  N  (real)  components  of  the  K  (A'/A")-dimensional 
hidden  unit  weight  vectors  w^,  which  we  denote  collectively  by  w,  form  the  ad¬ 
justable  parameters  of  a  TCM.  Without  loss  of  generality,  we  assume  the  weight 
vectors  to  be  normalized  to  w?  N/K.  We  shall  restrict  our  analysis  to  the  case 
where  both  the  input  space  dimension  and  the  number  of  hidden  units  are  large 
(N  oo,  K  — )•  oo),  assuming  that  each  hidden  unit  is  connected  to  a  large  number 
of  inputs,  i.e.,  N/K  1.  As  our  training  algorithm  we  take  (zero  temperature) 
Gibbs  learning,  which  generates  at  random  any  TCM  (in  the  following  referred  to 
as  a  ‘student’)  which  predicts  all  the  training  outputs  in  a  given  set  of  p  training 
examples  =  {{x^,y^),  p  =  l...p}  correctly.  We  take  the  problem  to  be  per¬ 
fectly  learnable,  which  means  that  the  outputs  corresponding  to  the  inputs  x^ 
are  generated  by  a  ‘teacher’  TCM  with  the  same  architecture  as  the  student  but 
with  different,  unknown  weights  w°.  It  is  further  assumed  that  there  is  no  noise  on 
the  training  examples.  For  learning  from  random  examples,  the  training  inputs  x^ 
are  sampled  randomly  from  a  distribution  Po(x).  Since  the  output  (1)  of  a  TCM 
is  independent  of  the  length  of  the  hidden  unit  input  vectors  x^-,  we  assume  this 
distribution  Po(x)  to  be  uniform  over  all  vectors  x^  =  (xj', . .  .,x^)  which  obey 
the  spherical  constraints  x?  =  N/K.  For  query  learning,  the  training  inputs  x^  are 
chosen  to  maximize  the  expected  information  gain  of  the  student,  as  follows.  The 
information  gain  is  defined  as  the  decrease  in  the  entropy  S  in  the  parameter  space 
of  the  student.  The  entropy  for  a  training  set  0^^)  is  given  by 

5(0(!’))  =  -J  (iwP(w|0(r’))lnP(  w  |0W).  (2) 

For  the  Gibbs  learning  algorithm  considered  here,  P(w|0(p))  is  uniform  on  the 
‘version  space’,  the  space  of  all  students  which  predict  all  training  outputs  cor¬ 
rectly  (and  which  satisfy  the  assumed  spherical  constraints  on  the  weight  vectors, 
w?  =  N/K),  and  zero  otherwise.  Denoting  the  version  space  volume  by  V^(0(p)), 
the  entropy  can  thus  simply  be  written  as  5(0^^^))  =  lnl/(0(p)).  When  a  new 
training  example  (x^+^,  ?/^+^)  is  added  to  the  existing  training  set,  the  information 
gain  is  i  —  S(0^p))  -  5(0^^+^)).  Since  the  new  training  output  is  unknown, 
only  the  expected  information  gain,  obtained  by  averaging  over  is  available 

for  selecting  a  maximally  informative  query  x^+A  As  derived  in  Ref.  [2],  the  prob¬ 
ability  distribution  of  7/^+^  given  the  input  xP+^  and  the  existing  training  set  0^^) 
is  =±l|xP+l,0('’))  =  v^,  where  =  F(0(^+i))|;,p+,=±i/V'(0(P)).  The 

expected  information  gain  is  therefore 

Wp(yp+i|xp+i,0(p))  =  ln?;“  (3) 

and  attains  its  maximum  value  ln2  (=  1  bit)  when  i.e.,  when  the  new 

input  xP+^  bisects  the  existing  version  space.  This  is  intuitively  reasonable,  since 
^  corresponds  to  maximum  uncertainty  about  the  new  output  and  hence  to 
maximum  information  gain  once  this  output  is  known. 
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Due  to  the  complex  geometry  of  the  version  space,  the  generation  of  queries  which 
achieve  exact  bisection  is  in  general  computationally  infeasible.  The  ‘query  by  com¬ 
mittee’  algorithm  proposed  in  Ref.  [2]  provides  a  solution  to  this  problem  by  first 
sampling  a  ‘committee’  of  2k  students  from  the  Gibbs  distribution  P(w\B^p^)  and 
then  using  the  fraction  of  committee  members  which  predict  d-l  or  -1  for  the  out¬ 
put  y  corresponding  to  an  input  x  as  an  approximation  to  the  true  probability 
P{y  =  ±l|x,  =  v^.  The  condition  |  is  then  approximated  by  the 

requirement  that  exactly  k  of  the  committee  members  predict  +1  and  —1,  respec¬ 
tively.  An  approximate  maximum  information  gain  query  can  thus  be  found  by 
sampling  (or  filtering)  inputs  from  a  stream  of  random  inputs  until  this  condition  is 
met.  The  procedure  is  then  repeated  for  each  new  query.  As  k  oo,  this  algorithm 
approaches  the  exact  bisection  algorithm,  and  it  is  on  this  limit  that  we  focus  in 
the  following. 

3  Exact  Maximum  Information  Gain  Queries 

The  main  quantity  of  interest  in  our  analysis  is  the  generalization  error  Cg,  defined 
as  the  probability  that  a  given  student  TCM  will  predict  the  output  of  the  teacher 
incorrectly  for  a  random  test  input  sampled  from  Po{x.).  It  can  be  expressed  in 
terms  of  the  overlaps  Ri  =  of  the  student  and  teacher  hidden  unit  weight 

vectors  Wj  and  w°  [6].  In  the  thermodynamic  limit,  the  Ri  are  self- aver  aging,  and 
can  be  obtained  from  a  replica  calculation  of  the  average  entropy  ^  as  a  function 
of  the  normalized  number  of  training  examples,  a  =  p/N;  details  will  be  reported 
in  a  forthcoming  publication  [7].  The  resulting  average  generalization  error  is  plot¬ 
ted  in  Figure  1;  for  large  a,  one  can  show  analytically  that  Cg  oc  exp(-Q;|  ln2). 
This  exponential  decay  of  the  generalization  error  Cg  with  a  provides  a  marked 


Figure  1  Left:  Generalization  error  Cg  vs.  (normalized)  number  of  examples  a, 
for  exact  maximum  information  gain  queries  (Section  3),  queries  selected  by  con¬ 
structive  algorithm  (Section  4),  and  random  examples.  Right:  In Cg  vs.  (normalized) 
entropy  s.  For  both  queries  and  random  examples,  Incg  Ri  |s  (thin  full  line)  for 
large  negative  values  of  s  (corresponding  to  large  a). 

improvement  over  the  Cg  oc  l/a  decay  achieved  by  random  examples  [6].  The  ef¬ 
fect  of  maximum  information  gain  queries  is  thus  similar  to  what  is  observed  for  a 
binary  perceptron  learning  from  a  binary  perceptron  teacher,  but  the  decay  con¬ 
stant  c  in  eg  oc  exp(-ca)  is  only  half  of  that  for  the  binary  perceptron  [2].  This 
means  that  asymptotically,  twice  as  many  examples  are  needed  for  a  TCM  as  for 
a  binary  perceptron  to  achieve  the  same  generalization  performance,  in  agreement 
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with  the  results  for  random  examples  [6] .  Since  maximum  information  gain  queries 
lead  to  an  entropy  s  =  —a  In  2  in  both  networks,  we  can  also  conclude  that  the 
relation  s  «  In  Cg  for  the  binary  perceptron  [2]  has  to  be  replaced  by  s  ft:  In  Cg  for 
the  tree  committee  machine.  Figure  1  shows  that,  as  expected,  this  relation  holds 
independently  of  whether  one  is  learning  from  queries  or  from  random  examples. 

4  Constructive  Query  Selection  Algorithm 

We  now  consider  the  practical  realization  of  maximum  information  gain  queries  in 
the  TCM.  The  query  by  committee  approach,  which  in  the  limit  ib  ^  oo  is  an  exact 
algorithm  for  selecting  maximum  information  queries,  filters  queries  from  a  stream 
of  random  inputs.  This  leads  to  an  exponential  increase  of  the  query  filtering  time 
with  the  number  of  training  examples  that  have  already  been  learned  [3].  As  a  cheap 
alternative  we  propose  a  simple  algorithm  for  constructing  queries,  which  is  based 
on  the  assumption  of  an  approximate  decoupling  of  the  entropies  of  the  different 
hidden  units,  as  follows.  Each  individual  hidden  unit  of  a  TCM  can  be  viewed 
as  a  binary  perceptron.  The  distribution  P{wi\Q^P^)  of  its  weight  vector  given 
a  set  of  training  examples  has  an  entropy  Si  associated  with  it,  in  analogy 
to  the  entropy  (2)  of  the  full  weight  distribution  P(w|©(p)).  Our  ‘constructive 
algorithm’ for  selecting  queries  then  consists  in  choosing,  for  each  new  query 
the  inputs  to  the  individual  hidden  units  in  such  a  way  as  to  maximize  the 
decrease  in  their  entropies  Si.  This  can  be  achieved  simply  by  choosing  each 
to  be  orthogonal  to  wf  =  (and  otherwise  random,  i.e.,  according 

to  Po(x))  [7],  thus  avoiding  the  cumbersome  and  time-consuming  filtering  from 
a  random  input  stream.  In  practice,  one  would  of  course  approximate  "wf  by  an 
average  of  2k  (say)  samples  from  the  Gibbs  distribution  P(w|©^^));  these  samples 
would  have  been  needed  anyway  in  the  query  by  committee  approach. 

The  generalization  performance  achieved  by  this  constructive  algorithm  can  again 
be  calculated  by  the  replica  method;  as  shown  in  Figure  1,  it  is  actually  slightly 
superior  to  that  of  exact  maximum  information  gain  queries.  The  a-dependence  of 
the  entropy,  s  =  — Q;ln2,  turns  out  to  be  the  same  as  for  maximum  information 
gain  queries;  this  indicates  that  the  correlations  between  the  individual  hidden 
units  become  sufficiently  small  for  K  — >•  oo,  so  that  queries  selected  to  minimize 
the  individual  hidden  units’  entropies  also  minimize  the  overall  entropy  of  the  TCM. 

5  Conclusions 

We  have  analysed  query  learning  for  maximum  information  gain  in  a  large  tree- 
committee  machine  (TCM).  Or  main  result  is  the  exponential  decay  of  the  general¬ 
ization  error  Cg  with  the  normalized  number  of  training  examples  a,  which  demon¬ 
strates  that  query  learning  can  yield  significant  improvements  over  learning  from 
random  examples  (for  which  Cg  oc  1/a  for  large  a)  in  multi-layer  neural  networks. 
The  fact  that  the  decay  constant  c  in  Cg  oc  exp(— ca)  differs  from  that  calculated 
for  single-layer  nets  such  as  the  binary  perceptron  raises  the  question  of  how  large 
c  would  be  in  more  complex  multi-layer  networks.  Combining  the  worst-case  bound 
in  [3]  in  terms  of  the  VC-dimension  with  existing  storage  capacity  bounds,  one 
would  estimate  that  c  could  be  as  small  as  0(1/ In  A')  for  networks  with  a  large 
number  of  hidden  units  K.  This  contrasts  with  our  result  c  — )■  const,  for  A  — oo, 
and  further  work  is  clearly  needed  to  establish  whether  there  are  realistic  networks 
which  saturate  the  lower  bound  c  =  0(1/ In  A). 
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We  have  also  analysed  a  computationally  cheap  algorithm  for  constructing  (rather 
than  filtering)  approximate  maximum  information  gain  queries,  and  found  that  it 
actually  achieves  slightly  better  generalization  performance  than  exact  maximum 
information  gain  queries.  This  result  is  particularly  encouraging  considering  the 
practical  application  of  query  learning  in  more  complex  multi-layer  networks.  For 
example,  the  proposed  constructive  algorithm  can  be  modified  for  query  learning 
in  a  fully-connected  committee  machine  (where  each  hidden  unit  is  connected  to  all 
the  inputs),  by  simply  choosing  each  new  query  to  be  orthogonal  to  the  subspace 
spanned  by  the  average  weight  vectors  of  all  K  hidden  units.  As  long  as  K  is 
much  smaller  than  the  input  dimension  N,  and  assuming  that  for  large  enough 
K  the  approximate  decoupling  of  the  hidden  unit  entropies  still  holds  for  fully 
connected  networks,  one  would  expect  this  algorithm  to  yield  a  good  approximation 
to  maximum  information  gain  queries.  The  same  conclusion  may  also  hold  for  a 
general  two-\diyev  network  with  threshold  units  (where,  in  contrast  to  the  committee 
machine,  the  hidden-to-output  weights  are  free  parameters),  which  can  approximate 
a  large  class  of  input-output  mappings.  In  summary,  our  results  therefore  suggest 
that  the  drastic  improvements  in  generalization  performance  achieved  by  maximum 
information  gain  queries  can  be  made  available,  in  a  computationally  cheap  manner, 
for  realistic  neural  network  learning  problems. 
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A  technique  for  obtaining  shift,  rotation  and  scale  invariant  signatures  for  two  dimensional  con¬ 
tours  is  proposed  and  demonstrated.  An  invariance  factor  is  calculated  at  each  point  by  comparing 
the  orientation  of  the  tangent  vector  with  vector  fields  corresponding  to  the  generators  of  Lie  trans¬ 
formation  groups  for  shift,  rotation  md  scaling.  The  statistics  of  these  invariance  factors  over  the 
contour  are  used  to  produce  an  invariance  signature.  This  operation  is  implemented  in  a  Model- 
Based  Neural  Network  (MBNN),  in  which  the  architecture  and  weights  are  parameterised  by  the 
constraints  of  the  problem  domain.  The  end  result  after  constructing  and  training  this  system  is 
the  same  as  a  traditional  neural  network:  a  collection  of  layers  of  nodes  with  weighted  connec¬ 
tions.  The  design  and  modeling  process  can  be  thought  of  as  compiling  an  invariant  classifier  into 
a  neural  network.  We  contend  that  these  invariance  signatures,  whilst  not  unique,  are  sufficient 
to  characterise  contoixrs  for  many  pattern  recognition  tasks. 

1  Introduction 

1.1  The  Model-Based  Approach  to  Building  Neural  Networks 

The  MBNN  approach  aims  to  retain  the  advantages  of  Traditional  Neural  Networks 
(TNNs),  i.e.  parallel  data-processing,  but  to  constrain  the  process  by  which  the 
architecture  of  the  network  and  the  values  of  the  weights  are  determined,  so  that  the 
designer  can  use  expert  knowledge  of  the  problem  domain.  MBNNs  were  introduced 
in  [1].  In  that  paper  we  proposed  networks  in  which  the  weights  were  functions  of  the 
relative  positions  of  nodes,  and  several,  possibly  shared,  parameters.  This  reduced 
the  dimensionality  of  the  space  searched  during  training,  and  the  size  of  the  training 
set  required,  since  the  network  was  guaranteed  to  respond  only  to  certain  features. 
The  resultant  network  was  just  a  collection  of  nodes  and  weighted  connections, 
exactly  as  in  a  TNN.  The  key  notion  was  that  neural  networks  with  highly  desirable 
properties  could  be  produced  by  using  expert  knowledge  to  constrain  the  weight 
determination  process.  The  MBNN  approach  departs  from  the  TNN  view  in  that 
the  operation  is  not  determined  entirely  by  the  training  set  supplied. 

1.2  Model-Based  Neural  Networks  and  Invariant  Pattern 
Recognition 

That  the  parameters  of  TNNs  are  entirely  learnt  can  be  a  limitation.  To  achieve 
good  performance,  the  training  set  must  be  sufficiently  large  and  varied  to  span 
the  input  space.  Collecting  this  data  and  training  the  network  can  be  very  time- 
consuming.  The  MBNN  formulation  allows  the  creation  of  networks  guaranteed  to 
respond  to  particular  features  in,  and  to  be  invariant  under  certain  transformations 
of,  the  input  image.  A  data  set  containing  various  shifted,  distorted  or  otherwise 
transformed  versions  of  the  input  patterns,  such  has  long  been  a  common  approach 
to  invariant  pattern  recognition  using  neural  networks  [2],  is  not  required.  The 
concept  of  MBNNs  is  here  extended  to  include  modular  networks.  Each  module 
has  a  well-defined  functionality.  The  weights  in  each  module  can  be  arrived  at  by 
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any  technique  at  all:  some  may  be  set  by  the  designer,  others  by  training  the  module 
for  a  specific  task.  This  approach  allows  the  the  network  designer  great  flexibility. 
Separately  trained  modules  can  perform  data  processing  tasks  independent  of  the 
final  classification  of  the  input  pattern. 

2  Invariance  Signatures 

2.1  Functions  Invariant  With  Respect  To  Lie  Transformation 
Groups 

We  wish  to  determine  the  invariance  of  a  function  F(x^y)  with  respect  to  a  Lie 
transformation  group. 

jT 

G(ar,?/)  =  [  9x{x,y)  9y{x,y)  ]  (1) 

is  the  vector  field  corresponding  to  the  generator  of  the  group.  F  is  constant  with 
respect  to  the  action  of  the  generator  if 

dF  dF 

VF-G{z,y)  =  ^,  i.e.  +  (2) 

Consider  a  contour  parameterised  by  t  on  which  F  is  constant; 

F(x{t),y(t))  =  C  V  t.  (3) 

Since  F  is  constant  on  the  contour,  we  have: 


-  ^  — 
dx  dt 


dy  dt 


so  that  . 

dx 

ay 


Thus  the  invariance  condition  given  in  equation  2,  holds  if 

dx  Qr. 


2.2  Mapping  Points  from  Image  Space  to  Invariance  Space 

The  tangent  to  the  contour  at  each  point  is  compared  with  the  vector  fields  charac¬ 
terising  given  transformations.  Both  the  tangent  vector  0{x,  y)  and  the  vector  fields 
Vc(x,j/)  are  normalised  everywhere.  The  measure  of  consistency  with  invariance 
class  c  at  point  (a?,  y)  is: 

ic{x,y)-  0{x,y)-Vc{x,y)\.  (6) 

This  operation  maps  points  from  the  image  to  the  interior  of  a  unit  hypercube  in 
an  n-dimensional  invariance  space: 


(7) 


The  origin  of  the  image  space  is  chosen  to  be  the  centroid  of  the  contour,  so  that 
the  invariance  signature  is  shift  invariant.  The  vector  field  for  rotation  invariance 


yxat{x,y)  =  j  [  -3/  X  ]^ .  (8) 

-\-y^ 

The  vector  field  for  dilation  invariance  is  similar.  For  the  translation  invariance 
case,  the  vector  field  vtrans(^jy)  is  constant  for  all  (a:,  y).  The  direction  is  that  of 
the  dominant  eigenvector  of  the  coordinate  covariance  matrix  of  the  contour.  The 
invariance  signature  of  an  image  consists  of  histograms  of  the  invariance  measures 
for  all  the  “on”  points.  It  is  invariant  under  rotations,  dilations,  translations  and 
reflections  of  the  image.  This  encoding  is  not  unique,  unlike  some  previous  integral 
transform  representations  [3]. 
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3  A  Neural  Network  System  For  Computing  Invariance 
Signatures 

A  MBNN  was  constructed  to  compute  invariance  signatures  and  classify  input  pat¬ 
terns  on  this  basis.  This  MBNN,  consisting  of  a  system  of  modules,  some  hand- 
coded  and  some  trained,  is  shown  in  Figure  1.  Whilst  this  system  appears  complex, 


Final  Classification 


Figure  1  Invariance  Signature-Based  Contour  Recognition  System. 

it  retains  the  basic  characteristics  of  a  TNN^.  Each  node  i  computes  the  sum 
of  its  weighted  inputs,  neti  =  J2j  This  is  used  as  the  input  to  the  trans¬ 

fer  function  /,  which  is  either  linear,  f{netj)  =  netj,  or  the  standard  sigmoid, 
f{netj)  =  •  The  only  departure  from  a  TNN  is  that  some  of  the  weights 

are  dynamic:  the  weight  is  calculated  by  a  node  higher  up  in  the  network.  This 
allows  the  MBNN  to  compute  dot  products^,  and  some  nodes  to  act  as  gates.  Since 
weights  in  any  neural  network  implementation  are  just  references  to  a  stored  value, 
this  should  not  present  any  difficulty.  There  is  insufficient  space  here  to  describe 
all  the  modules  in  the  system.  Descriptions  of  the  Local  Orientation  Extractor  and 
the  Binning  Unit  are  given  as  examples  of  the  way  the  modules  were  constructed. 

^with  the  exception  of  the  Dominant  Image  Orientation  Unit,  for  which  a  neural  network 
solution  is  still  being  developed. 

^The  calculation  of  dot  products  is  achieved  by  using  the  outputs  of  one  layer  as  the  weights 
on  connections  to  another  layer. 
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3.1  Local  Orientation  Extraction 

A  simple  and  robust  estimate  of  the  tangent  vector  at  a  point  is  the  dominant 
eigenvector  of  the  covariance  matrix  of  a  square  window  centred  on  that  point. 
The  vector  magnitudes  are  weighted  by  the  strength  of  the  orientation.  Let  the 
eigenvalues  be  Ai  and  A2,  Ai  >  A2.  The  corresponding  unit  eigenvectors  are  ei  and 
02-  The  weighted  tangent  vector  estimate  s  is: 

s  =  |  (9) 

[0  Ai  =  0. 

A  TNN  with  a  3  x  3  input  layer,  20  hidden  nodes,  and  2  output  nodes  was  trained 
using  backpropagation  [4]  to  produce  this  output  for  all  possible  binary  input  im¬ 
ages  with  a  centre  value  of  1.  Training  was  stopped  after  6000000  iterations,  when 
96.89%  of  the  training  set  variance  was  fitted.  This  problem  is  similar  to  edge  ex¬ 
traction,  except  that  edge  extraction  is  usually  performed  on  greyscale  gradients 
rather  than  thin  binary  contours.  Srinivasan  et  ai  [5]  have  developed  a  neural  net¬ 
work  edge  detector  which  produces  a  weighted  vector  output  like  that  in  equation 
9.  We  intend  to  produce  a  more  compact  and  accurate  tangent  estimator  using  a 
MBNN  incorporating  Gabor  weighting  functions,  as  used  in  [1]. 

3.2  The  Binning  Unit 


Node  A 


Figure  2  Neural  Binning  Subsystem. 


The  weights  for  the  binning  unit  in  figure  2  were  determined  by  hand.  There  is  a 
binning  unit  for  each  of  the  n  bins  in  each  invariance  signature  histogram.  Each 
binning  unit  is  connected  to  all  nodes  in  the  invariance  image,  and  inputs  to  it  are 
gated  by  the  input  image,  so  that  only  the  N  nodes  corresponding  to  ones  in  the 
input  image  contribute.  The  bins  have  width  —zit  since  the  first  bin  is  centred  on 
0,  and  the  last  on  1  Nodes  A  and  B  only  have  an  output  of  1  when  the  input  x 
is  within  bin  i.  This  condition  is  met  when: 


2i-  1 


<  X  < 


2i+l 


2(n~l)  2(n  -  1) 

To  detect  this  condition,  the  activations  of  nodes  A  and  B  are  set  to: 

a(2i  -  1) 


net  A  ~  ax  — 


2(n  -  1) 


(10) 

(11) 


^  A  bin  ending  at  0  or  1  would  miss  contributions  from  the  extrema  since  the  edge  of  the  neural 
bin  is  not  vertical. 


348 


Chapter  60 


nets  — 


■ax  + 


a(2^■+  1) 


(12) 


2(n-l) 

where  a  is  a  large  number,  causing  the  sigmoid  to  approximate  a  step  function.  Here, 
a  —  1000  was  used.  Node  C  acts  as  an  AND  gate.  Node  D  sums  the  contributions 
to  bin  i  from  all  N  nodes. 


4  Simulation  Results 

To  demonstrate  the  feasibility  of  such  an  invariance  signature-based  classifier,  a 
small  simulation  was  done.  The  task  was  to  classify  an  input  pattern  as  one  of  the 
ten  first  letters  of  the  alphabet.  The  training  set  contained  four  examples  of  each 
letter^,  each  differently  rotated  or  reflected^,  within  an  18  x  18  input  image.  The 
test  set  contained  three  differently  transformed  examples  of  each  letter.  The  final 
classification  stage  of  the  MBNN  used  a  9  node  hidden  layer  and  was  trained  using 
backpropagation.  A  TNN  with  an  18  x  18  input  layer,  9  node  hidden  layer  and  10 
node  output  layer  was  also  trained.  After  1000  iterations,  both  correctly  classified 
100%  of  the  training  data.  The  MBNN  correctly  classified  76.66%  of  the  test  data, 
whereas  the  TNN  did  not  correctly  classify  any  of  it.  It  is  clear  that  the  MBNN 
massively  outperformed  the  TNN.  Had  the  Local  Orientation  Extractorheen  ideal, 
the  optimal  performance  on  the  training  data  would  have  been  80%,  since  “b”  and 
“d”  are  identical  under  reflection  in  the  font  used. 


5  Conclusion 

Using  the  MBNN  approach,  neural  networks  can  be  constructed  which  are  guaran¬ 
teed  to  classify  input  contours  using  invariance  signatures  which  are  rotation,  di¬ 
lation,  translation  and  reflection  invariant.  The  resultant  MBNN  retains  the  useful 
properties  of  being  composed  of  similar  simple  nodes  joined  by  weighted  connec¬ 
tions,  with  inherently  parallel  computation.  The  MBNN  approach  is  thus  a  useful 
technique  for  compiling  a  neural  network,  incorporating  the  designer’s  expert  knowl¬ 
edge. 
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This  paper  presents  constructive  approximations  by  three-layer  artificial  nemal  networks  with  (1) 
trigonometric,  (2)  piecewise  linear,  and  (3)  sigmoidal  hidden-layer  units.  These  approximations 
provide  (a)  approximating-network  equations,  (b)  specifications  for  the  numbers  of  hidden-layer 
xmits,  (c)  approximation  error  estimations,  and  (d)  saturations  of  the  approximations. 
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1  Introduction 

Previous  studies  on  function  approximation  by  artificial  neural  networks  show  only 
the  existence  of  approximating  networks  by  non- constructive  methods[l][2]  and  thus 
contribute  almost  nothing  to  developing  the  networks  and  to  specifying  the  prop¬ 
erties  of  the  approximations.  This  paper  presents  constructive  approximations  by 
networks  with  (1)  trigonometric,  (2)  piecewise  linear,  and  (3)  sigmoidal  hidden- 
layer  units.  These  approximations  provide  (a)  approximating-network  equations, 
(b)  specifications  for  the  numbers  of  hidden-layer  units,  (c)  approximation  error 
estimations,  and  (d)  saturations  of  the  approximations. 


2  Preliminaries 

Let  IN  and  IR  be  the  set  of  natural  and  real  numbers,  and  INq  =  IN  U  {0}.  Let 

|r|  =  Er=i  Ini  for  r  =  (r,)™,  6  IN^  and  ||t||  =  for  *  =  e 

IR”^.  For  p  >  1,  we  denote  by  space  of  27r-periodic  on  each  IR 

of  IR”^  pth-order  Lebesgue-integrable  functions  defined  on  IR"^  to  IR  with 

norm  ||/||jr,p  =  |(2:r)“”^  \f  (x)|^dx|  and  by  (72^  (IR”^)  the  space  of 

27r-periodic  continuous  functions  defined  on  IR”^  to  IR  with  C'27r-norm  ||/||(72,r  “ 
sup  {|/(x)|  ;  <  TT,  2  =  1,...,  m}.  Let  W  =  (IR"^)  or  C27r  (IR"^)  throughout 

this  paper.  For  /,  p  G  (/(t) ,  g  (t))  =  (27r)””^  f  (^)ff  (t)dt  and  the 

convolution  /  *  S'  (x)  =  (27r)“”^  f  (t)  ^  (x  —  t)dt.  Let  the  sigmoidal  func¬ 

tion  sig(a?)  =  {1  -f  exp  (— Let  /  G  ^  and  ^  >  0.  We  introduce  the  modulus 
of  continuity  of  /  in  ^  (/,  6)  =  sup  {||/(  •  -|- 1)  -  /  ( •  )||^  ;  t  €  BR’^,  ||t||  <  6}, 

where  ||/ (  •  +  t)  -  / ( ■)\\^^  =  {(27r)"'"  •  •  '/E  1/ (^  +  *)  “  /  ^nd 

11/ (  •  +t)-  /  (Ollc,,  =  sup{|/(x  +  t)  -/(x)|  ;  |a;j|  <  jr,  i  =  1, . .  .,m}  . 

We  say  /  satisfies  a  Lipschitz’s  condition  with  constant  M  >  0  and  exponent  2/  >  0 
in  W,  if 

ii/(-  +t)-/(-)ii«<Miitir 

for  t  G  IR"^.  Let  Lip^  (^;  jy)  be  the  set  of  functions  satisfying  this  condition  and 
Lipschitz’s  class  of  order  ly  Lipu  =  U  {Lipjv^  (^;  p)  \  M  >  0}.  We  notice  that  if 
/  E  ’F  satisfies  a  Lipschitz’s  condition  with  constant  M  and  exponent  in  then 
(/,  6)  <M6^. 
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3  Results 

3.1  Constructive  Approximations  and  approximation  error 
estimations 

tn  f  \  Yj% 

Let  =  n  for  r  =  {n)T=i  ^  and  =  ( A+2)  for  A  €  IN  in  this 

section. 

Theorem  1  (trigonometric  case)  Let  f  =  (/j)”=i  he  a  function  defined  on  IR"^ 
to  IR”  such  that  fi  G  For  arbitrary  parameter  X  G  IN,  a  three-layer  network  with 

trigonometric  hidden-layer  units  TN  [f]^  =  (tN  approximates  to  f  with 

^-norm,  such  that 

TN[fit(x)  =  e[fif  + 

0<pi,  qi<\ 

{“  [/i]^  (P.  q)  cos  (p  -  q)  X  +  ^  [/,]^  (p,  q)  sin  (p  -  q)  x},  (1) 

(p.q) 

where  e[fif  -  {/i(t),l),  a  [/if  (p,  q)  =  2S-' 6^  6^  (/•  (t) ,  cos (p  -  q) t),  and 
0  [/i]^  (P.  q)  =  2-B''  ftq  {fi  (t)  ,  sin  (p  —  q)  t).  (The  above  summation  is  over  com¬ 
binations  o/p  =  q  =  ^  IN^  that  p  7^  q,  0  <  pi,  qi  <  X,  i.e.,  if 

the  summation  is  added  in  the  case  of  (p,  q),  it  is  not  added  in  the  case  0/ (q,  p). 
This  notation  of  the  summation  is  used  throughout  this  paper.)  Then  TN[f]''  has 
(2A  +  1)”^  —  1  hidden-layer  units.  Also  the  approximation  error  of  each  coordinate 
with  -norm  for  i  —  1, . . . ,  n  is  estimated  by 

||/i-TJV[/.f||^<  (l  +  yV^)a;®(/i,  (A  +  2)-').  (2) 


The  next  corollary  means  that  TN  [f]^  can  approximate  f  with  any  degree  of  ac¬ 
curacy  if  A  increases  i.e.,  the  number  of  hidden-layer  units  increases. 


Corollary  1  fi  -  TN  [fi]^ 


^  0  as  A 


oo  («  =  1, . . . ,  n) . 


Theorem  2  (piecewise  linear  case)  Let  f  =  (/i)”_i  be  a  function  defined  on 
JR"*  to  IR”  such  that  fi  G  arbitrary  independent  parameters 

A,  cr  G  IN,  a  three-layer  network  with  piecewise  linear  hidden-layer  units  PN  ^  = 
approximates  to  f  with  L^Tr-'^^orm,  such  that 

PjV[/if’‘’(x)  =  0[/,f'"  + 

4|p-q|<7~l 

X)  X  “[/.^’'’(p,  q,  *)P£^((p-q)x),  (3) 

(p,  q)  k=0 


where 

PLtirx) 


0  (rx<-|r|x+|i),  1  {rx>-\r\^+^^) 

^rx  +  2|r|<r-fc  (_  |r| ,  +  <  rx  <  -  |r|  ;r  +  ’ 


0<P». 

^[/if-'’  =  (/.(t),l)  +  2S^sin^  X  (-l)"’'’'6p«'q(/i(t),cos(p-q)t), 

(p.q) 
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(/■  (t) ,  sin  (p  -  q)  t)  cos 


(2k  +  1)  TT 


cos(p-q)t)sin 


.  (2fc  +  l) 


4(7  '-x  /  '  -  -  '  4a- 

Then  PN  has  2mo-X  (A  +  1)  (2A  H-  hidden-layer  units.  Also  the  approx¬ 

imation  error  of  each  coordinate  with  L^^^-norm  is  estimated  by 


“•c)} 


1/p 


■  (4) 


Theorem  3  (sigmoidal  case)  Lett  =  be  a  function  defined  on  IR”^  to  IR” 

such  that  fi  G  For  two  arbitrary  independent  parameters  A,C7  G  IN,  a 

three-layer  network  with  sigmoidal  hidden-layer  units  SN  =  (^SN  ^ 

approximates  to  f  with  iJ^-K-jiorm,  such  that 
5Ar[/jf’‘’(x)  =  «[/i]^'‘'  + 

0<pM?i<A  4!p-q|a-l 

E  E  «[/.f'"(p.q,fc)5G?((p-q)x),  (5) 

(p,q) 

where  SGI  (rx)  =  sig  (^rx  +  8  |r|  cr  -  4^  -  2) ,  and  0  ^  and  a  ^  (p,  q,  k) 


are  the  same  as  in  Theorem  2.  Then  SN  [f]^’  ^  has  2mi7\  (A  +  1)  (2A  +  1)”^  ^  hidden- 
layer  units.  Also  the  approximation  error  of  each  coordinate  with  L^T^-norm  is  es¬ 
timated  by 


.m—l 


<  1  + 


TTy/m 


+ 


(2,rr-^<T  2  +  t  “*1 


Remark  1  PN[f]^’'^  and  SN[f]^’‘^  in  Theorems  2  and  3  are  based  on  TN  [f]^  in 


lA,  a 


Theorem  1.  In  fact,  for  any  A  G  IN,  PN[f]^’^  and  SN[f]'^’*^  approach  TN  [f]''  as 
(7  increases.  Therefore  they  are  almost  determined  by  A,  if  a  is  large  enough  for  X. 

For  any  A  G  IN,  the  second  terms  of  the  right-hand  members  of  Inequalities  (4)  and 


lA,  <7 


(6)  approach  0  as  cr  increases.  Then,  \\fi  ~  PN  [fi] 


lA,  <7 


and 


fi-SN[fif’^ 


are  almost  estimated  by  the  first  terms,  which  are  the  same  as  Inequality  (2),  when 
(7  is  large  enough  for  A.  The  following  two  corollaries  give  conditions  on  a  in  terms  of 
A  that  the  approximation  errors  approach  0,  if  A  increases.  Under  these  conditions 
PN[f]^’^  and  SN[f]^’^  can  approximate  f  with  any  degree  of  accuracy,  if  the 
number  of  hidden-layer  units  increases. 

Corollary  2  If  or  is  a  higher-order  infinity  than  X~^ ,  i.e.,  a  —  cr(A)  oo  and 


mp 


0  as  A  *->•  oo,  then 


fi-PN[fi] 


A,  <7 


0 


oo  (i  =  1 , . . . ,  n) . 
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Corollary  3  If  a  is  a  higher-order  infinity  than  ,  i.e.,  (t  =  a  [X)  ^  oo  and 


(7(A) 


0  a5  A  — +  oo,  then  \\fi  —  SN  [fi]  ’ 


A,  <7 


0  as  X  ^  oo  (i  =  1, . . . ,  n) . 


3.2  Saturation  of  Constructive  Approximations 

We  denote  0x  =  o  co),  if  0  as  A  — >■  oo  and  Ox  —  O  (Ca) 


there  exists  a  constant  M  >  0  such  that 


<  M  for  large  enough  A. 


(A 


oo),  if 


Definition  1  (Saturation  of  an  approximation  method)  A  sequence  of  op¬ 
erators  U  =  {f^A};^giN  in  W  is  called  an  approximation  method  in  if 

lim  ||t/A(/)-/||^=0 

A-^co 

for  /  6  Let  {^aIasin  ®  sequence  of  real  numbers  such  that  Ox  0  as 
X  oo.  An  approximation  method  U  in  ^  is  saturated  with  order  Ox,  if  (SI) 
If  f  E  ^  and  ||f^A  (/)  ~ /||$  =  ^(^a)  (A  — >  oo),  then  f  is  an  identical  element 
ofU,  i.e.,  Ux{f)  —  f  for  A  €  IN  and  (S2)  The  saturation  class  of  U  5  [^] 

{  f  ;  \\Ux  if)  -  /ll^  =  L)(0x)  (A  -4  oo)}  contains  at  least  one  element  which 
is  not  an  identical  element  ofU. 


Theorem  4  (trigonometric  case)  Approximation  by  TAf  =  TN^  [  in 

t  J  A6IN 

Theorem  1  is  an  approximation  method  saturated  with  order  A“^  in  ^  and  Us 
identical  elements  are  just  constant  functions  almost  everywhere  on  When 

^  =  C27r(ni’^),  the  saturation  class  S[TJ\f\  is  characterized  as 

S  [TAf]  D  |/  G  ^  ^  G  Lip  s  1, . . . ,  m| 

and,  ifm  =  l,  S[TAf\  =  |/  G  ^  ^  E  Lip  i|. 

Remark  2  (piecewise  linear  and  sigmoidal  cases)  Approximations  by 

PN[f]'^’‘^  and  SNff]^’*^  in  Theorems  2  and  3  are  also  saturated  with  order  A~^ 
in  *5  large  enough  for  X.  This  is  because,  for  any  X  G  IN,  PN 

andSN[f]^’^  approach  TN  [f]"^  as  a  increases  stated  in  Remark  1. 

3.3  Outline  of  the  Proof 

Multidimensional  extension  of  approximation  by  the  convolution  operator  and  the 
multidimensional  Fejer- Korovkin  kernel,  which  are  introduced  in  this  study,  provide 
a  construction  of  an  approximating  network  with  trigonometric  hidden-layer  units, 
the  approximation  error  estimation,  and  sturation  of  the  approximation.  Then, 
from  the  network  and  constructive  approximation  to  multidimensional  trigonomet¬ 
ric  functions  by  networks  with  piecewise  linear  and  sigmoidal  hidden-layer  units, 
a  constructions  of  approximating  networks  with  piecewise  linear  and  sigmoidal 
hidden-layer  units,  the  approximation  error  estimations,  and  sturation  of  the  ap¬ 
proximations  are  derived. 

4  Example 

This  is  an  example  of  an  approximation  to  the  function  exp  (—  (x^  y^)).  Approx¬ 

imating  networks  are  calculated  at  A  2,  4,  6,  8,  10  and  a  =  60.  The  numbers  of 
hidden-layer  units  are  respectively  24,  80,  168,  288,  440  in  the  trigonometric  case 
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Figure  1  The  approximated  function. 


Figure  2  Approximating  networks  with  three  kinds  of  hidden  layer  units  at  cr  = 
60.  (When  A  is  the  same,  the  three  kinds  of  networks  have  almost  the  same  figures.) 


and  7.2  x  10^,  4.3  x  10"^,  1.3  x  10^,  2.9  x  10®,  5.5  x  10®  in  the  piecewise  linear  and 
sigmoidal  cases.  The  actual  errors  with  LlT^-novm  are  numerically  calculated  from 
the  left-hand  members  of  Inequalities  (2),  (4),  and  (6),  and  the  estimated  errors  are 
calculated  from  their  left-hand  members. 

5  Conclusion 

This  paper  presents  constructive  approximations  by  three-layer  artificial  neural  net¬ 
works  with  (1)  trigonometric,  (2)  piecewise  linear,  and  (3)  sigmoidal  hidden-layer 
units  to  27r-periodic  pth-order  Lebesgue  integrable  functions  defined  on  to 
m”  for  p  >  1  with  L^^-norm.  (In  the  case  of  (1),  networks  with  trigonometric 
hidden-layer  units  can  also  approximate  27r-periodic  continuous  functions  defined 
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on  to  IR”  with  C2jr-norm  in  the  same  time.)  The  approximations  provide  (a) 
approximating-network  equations,  (b)  specifications  for  the  numbers  of  hidden-layer 
units,  (c)  approximation  error  estimations,  and  (d)  saturations  of  the  approxima¬ 
tions.  These  results  can  easily  be  applied  to  non-periodic  functions  defined  on  a 
bounded  subset  of  IR”^ . 
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In  this  article,  two  neural  network  clustering  techniques  are  compared  to  classical  statistical  tech¬ 
niques.  This  is  achieved  by  examining  the  results  obtained  when  applying  each  technique  to  a 
real-world  phoneme  recognition  task.  An  analysis  of  the  phoneme  datasets  exposes  the  clusters 
which  exist  in  the  pattern  space.  The  study  of  the  similarity  of  the  patterns  which  are  clustered 
together  allows  an  objective  evaluation  of  the  clustering  efficiency  of  these  techniques.  It  also  gives 
rise  to  a  revealing  comparison  of  the  way  each  technique  clusters  the  dataset. 

1  Introduction 

Clustering  algorithms  attempt  to  organise  unclassified  patterns  into  groups  in  such 
a  way  that  patterns  within  each  group  are  more  similar  than  patterns  belonging  to 
different  groups.  In  classical  statistics,  there  exist  a  wide  range  of  agglomerative  and 
divisive  clustering  techniques  [1],  which  use  as  distance  measures  either  Euclidean- 
type  distances  or  other  metrics  suitable  for  binary  data.  Recently,  techniques  based 
on  neural  networks  have  been  developed  and  have  been  found  well-suited  to  cluster¬ 
ing  large,  high- dimensional  pattern  spaces.  In  this  article,  the  clustering  potential 
of  two  fundamentally  different  neural  network  models  is  studied  and  compared  to 
that  of  statistical  techniques.  The  network  models  are:  (a)  the  Self- Organising  Logic 
Neural  Network  (SOLNN)  [2],  which  is  based  on  the  n-tuple  sampling  method  and 
(b)  the  Harmony  Theory  Network  (HTN)  [3]  which  constitutes  a  derivative  of  the 
Hopfield  network  and  a  variant  of  the  Boltzmann  machine. 

In  an  effort  to  evaluate  the  effectiveness  of  the  two  network  models  when  discovering 
the  clusters  which  exist  in  the  pattern  space,  a  comparison  is  made  to  a  number  of 
established  statistical  methods.  This  comparison  is  performed  in  a  series  of  experi¬ 
ments  which  use  real-world  phoneme  data.  A  number  of  phonemes  is  selected  and 
used  as  prototypes.  Various  levels  of  noise  are  injected  to  the  prototypes,  resulting 
in  different  datasets,  each  consisting  of  well  defined  phoneme-classes.  The  varying 
levels  of  noise  cause  each  dataset  to  have  fundamentally  different  characteristics. 
The  behaviour  of  the  clustering  techniques  is  then  studied  for  each  dataset. 

2  Overview  of  the  Clustering  Techniques 

The  Self-Organising  Logic  Neural  Network  (SOLNN)  shares  the  main  structure  of 
the  discriminator  network  [4],  It  is  consequently  based  on  the  decomposition  of  the 
input  pattern  into  tuples  of  n  pixels  and  the  comparison  of  these  tuples  to  the  corre¬ 
sponding  tuples  of  training  patterns.  In  the  SOLNN  model,  the  basic  discriminator 
structure  has  been  extended  by  allocating  m  bits  to  each  tuple  combination  rather 
than  a  single  one.  This  allows  the  network  to  store  information  concerning  the  fre¬ 
quency  of  occurrence  of  the  corresponding  tuple  combination  during  learning  and 
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is  instrumental  to  the  SOLNN’s  ability  to  learn  in  the  absence  of  external  super¬ 
vision.  The  SOLNN  has  been  shown  to  successfully  perform  clustering  tasks  in  an 
unsupervised  manner  [2,5].  The  SOLNN  model  is  characterised  by  the  distribution 
constraint  mechanism  [5]  which  enables  the  user  to  determine  the  desired  radius 
of  the  SOLNN  clusters.  This  mechanism  is  similar  to  the  vigilance  parameter  of 
Adaptive  Resonance  Theory  (ART)  [6]. 

The  Harmony  Theory  Network  (HTN)  [3]  consists  of  binary  nodes  arranged  in 
exactly  two  layers.  For  this  task,  the  lower  layer  encodes  the  features  of  the  unclas¬ 
sified  patterns  and  the  upper  layer  encodes  the  candidate  patterns  of  the  clustering 
task.  Each  connection  between  a  feature  and  a  classified  pattern  specifies  the  posi¬ 
tive  or  negative  relation  between  them,  i.e.  whether  or  not  the  pattern  contains  the 
feature.  No  training  is  required  to  adapt  the  weights,  which  depend  on  the  local 
connectivity  of  the  HTN  [3].  During  clustering,  each  candidate  pattern  is  input  to 
the  lower  layer  of  the  HTN;  the  activated  nodes  of  the  upper  layer  constitute  the 
patterns  to  which  the  candidate  pattern  is  clustered  (also  see  [7]  for  a  more  detailed 
description).  Both  the  required  degree  of  similarity  between  clustered  patterns  and 
the  desired  number  of  clusters  are  monitored  by  the  parameter  k  of  Harmony  The¬ 
ory,  which  resembles  the  vigilance  parameter  of  ART  [6]  and  the  radius  of  RBF 
(Radial-Basis  Function)  networks  [8].  Its  value  is  changed,  in  a  uniform  manner  for 
all  candidate  patterns,  in  the  search  for  optimal  clustering  results. 

Hierarchical  statistical  clustering  techniques  [1]  of  the  agglomerative  type  are  used 
for  clustering  in  this  article.  The  techniques  employed  are  (i)  the  single  linkage,  (ii) 
the  complete  linkage,  (iii)  the  median  cluster,  (iv)  the  centroid  cluster  and  (v)  the 
average  cluster. 

3  Description  of  the  Clustering  Experiments 

The  data  employed  in  the  experiments  are  real-world  phonemes  which  have  been 
obtained  from  the  dataset  of  the  LVQ-PAK  simulation  package  [9].  The  phonemes 
in  this  package  have  been  pre-processed  so  that  each  phoneme  consists  of  20  real¬ 
valued  features.  The  selected  phonemes  (called  prototypes)  correspond  to  the  letters 
“A”,  “0”,  “N”,  “r,  “M”,  «U”  and  “S”.  Since  both  the  SOLNN  and  the  HTN  net¬ 
works  require  binary  input  patterns,  the  phonemes  have  been  digitised,  by  encoding 
each  real- valued  feature  into  four  bits  via  the  thermometer-coding  technique  [10]. 
Consequently,  the  resulting  prototypes  are  80-dimensional  binary  patterns. 

Each  of  the  prototypes  has  been  used  to  generate  a  number  of  noisy  patterns  by 
adding  random  noise  of  a  certain  level,  namely  2.5,  5,  7.5  and  10%.  A  different 
dataset  (experiment-dataset)  is  created  for  each  level  of  noise.  Every  experiment- 
dataset  consists  of  groups  of  noisy  patterns  whose  centroids  coincide  with,  or 
are  situated  very  near  to,  the  prototypes.  The  different  levels  of  noise  cause  the 
experiment-datasets  to  occupy  varying  portions  of  the  input  space  and  the  groups 
of  noisy  patterns  to  overlap  to  a  different  extent.  The  prototype  and  the  noisy  pat¬ 
terns  for  each  level  of  noise  constitute  a  phoneme  class.  An  analysis  of  the  phoneme 
classes  in  each  experiment-dataset  indicates  that  the  patterns  of  each  phoneme  class 
are  closer  to  other  patterns  of  the  same  class  than  to  patterns  of  other  phoneme 
classes.  As  the  noise  level  increases,  each  phoneme  class  occupies  a  larger  portion 
of  the  pattern  space  and  the  distance  between  phonemes  from  different  classes  is 
reduced,  while  the  probability  of  an  overlap  occurring  between  different  classes  in- 
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Phoneme 

Class 

Average 
Distance 
within  class 

Minimum 
Distance 
within  class 

Maximum 
Distance 
within  class 

Minimum 
Distance 
between  classes 

A 

16.11% 

10.00% 

20.00% 

22.50%  (A  k  O) 

0 

16.67% 

10.00% 

20.00% 

22.50%  (0  k  A) 

N 

16.22% 

7.50% 

20.00% 

23.75%  (N  k  0) 

I 

16.33% 

10.00% 

20.00% 

20.00%  (I  k  M) 

M 

16.22% 

10.00% 

20.00% 

20.00%  (M  k  I) 

U 

16.06% 

10.00% 

20.00% 

23.75%  (U  k  0) 

s 

16.22% 

10.00% 

20.00% 

26.25%  (S  k  I) 

Table  1  Characteristics  of  the  pattern  space  used  for  10%  noise  level.  Distances 
are  cfilculated  as  percentages  of  the  total  number  of  pixels. 


creases.  The  10%-noise  dataset  is  probably  the  most  interesting  one  since,  in  that 
case,  the  minimum  distance  between  two  phonemes  from  different  classes  (more 
specifically  classes  ‘T’  and  “M”)  becomes  equal  to  the  maximum  distance  between 
patterns  within  any  phoneme  class.  Due  to  this  fact,  it  is  expected  to  be  the  most  dif¬ 
ficult  experiment-dataset  to  cluster  successfully.  Its  characteristics  are  summarised 
in  Table  1  to  allow  for  a  detailed  evaluation  of  the  clustering  results. 

The  task  consists  of  grouping  the  patterns  of  each  experiment-dataset  so  that  the 
phoneme  classes  are  uncovered.  The  results  obtained  by  each  of  the  three  clustering 
techniques  are  evaluated  by  taking  into  account  the  characteristics  and  topology 
of  each  experiment-dataset,  i.e.  the  pixel- wise  similarity  between  the  patterns  and 
the  clusters  in  which  they  have  been  organised  by  each  technique.  This  enables  (i) 
a  comparison  of  the  way  in  which  each  technique  operates  for  various  data  distri¬ 
butions  and  (ii)  an  evaluation  of  the  effect  that  the  relation  between  the  maximum 
distance  of  patterns  of  the  same  phoneme  class  and  the  minimum  distance  of  pat¬ 
terns  of  different  phoneme  classes  has  on  clustering. 

Additionally,  a  statistical  analysis  of  the  pattern  space  created  by  each  experiment- 
dataset  has  been  performed  to  investigate  how  elfective  the  clustering  techniques 
actually  are.  This  investigation  is  based  on  the  similarity  between  classes  in  the 
pattern  space.  The  comparison  of  the  two  neural  network  techniques  and  the  sta¬ 
tistical  methods,  together  with  an  in-depth  analysis  of  the  pattern  space,  provides 
an  accurate  insight  to  the  capabilities  and  limitations  of  each  of  the  techniques 
studied. 

4  Experimental  Results 

The  different  clustering  techniques  are  applied  to  the  clustering  task  described  in 
the  previous  paragraph.  The  results  obtained  are  summarised  in  Table  2,  where  the 
following  information  is  contained: 

(i)  the  proportion  of  dataset  phonemes  that  are  correctly  classified,  that  is  of  pat¬ 
terns  assigned  to  a  cluster  representing  their  phoneme  class; 

(ii)  the  number  of  multi-phoneme  clusters,  that  is  clusters  containing  patterns  from 
more  than  one  phoneme  class; 
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Noise 

Level 

Criterion 

SOLNN 

HTN 

Statistical 

2.5%, 

5%, 

7.5% 

Correct  classification 

100% 

100% 

100% 

Multi-phoneme  clusters  formed 

0 

0 

0 

Phonemes  in  multi-phoneme  cluster 

1 

1 

1 

Number  of  created  clusters 

7 

7 

7 

Number  of  clusters  per  phoneme 

1 

1 

1 

10% 

Correct  classification 

86%/100% 

97% 

100% 

Multi-phoneme  clusters  formed 

4/0 

1 

0 

Phonemes  in  multi-phoneme  cluster 

4/1 

2 

1 

Number  of  created  clusters 

7/10 

21 

7 

Number  of  clusters  per  phoneme 

4/2 

4 

1 

Table  2  Comparative  results  of  the  three  methods  for  the  different  noise  levels. 
In  the  case  of  the  SOLNN,  for  10%  noise,  two  sets  of  results  £ire  noted,  the  first 
corresponding  to  a  7-discriminator  network  and  the  second  to  a  10- discriminator 
network.  In  the  case  of  statistical  methods,  the  results  are  obtained  using  the  Ham¬ 
ming  distance  metric. 


(iii)  the  maximum  number  of  phoneme  classes  contained  in  any  multi-phoneme 
cluster; 

(iv)  the  number  of  clusters  created  by  each  method; 

(v)  the  maximum  number  of  clusters  to  which  patterns  of  any  phoneme  class  are 
ctssigned. 

As  can  be  easily  seen,  the  value  of  criterion  (i)  should  ideally  be  equal  to  100%, 
the  value  of  criterion  (ii)  should  be  equal  to  0,  the  value  of  criterion  (iv)  should  be 
equal  to  7,  while  the  values  of  criteria  (iii)  and  (v)  should  be  equal  to  1. 

In  the  application  of  the  SOLNN  to  the  experiment- dataset,  several  network  sizes 
ranging  from  7  to  70  discriminators  are  simulated  in  order  to  investigate  the  effect 
of  the  network  size  on  the  SOLNN  clustering  performance.  As  has  been  shown 
[5],  the  SOLNN  requires  several  iterations  before  settling  to  a  clustering  result, 
gradually  separating  each  class.  As  the  clustering  task  becomes  more  difficult,  that 
is  as  the  pattern  classes  become  less  clearly  separated,  the  number  of  required 
iterations  increases.  For  noise  levels  up  to  7,5%,  the  SOLNN  succeeds  in  separating 
all  phoneme  classes  for  all  network  sizes  by  creating  a  single  group  for  each  class.  In 
the  case  of  the  10%  noise  level,  when  the  distance  between  the  phoneme  classes  is 
considerably  reduced,  the  SOLNN  requires  a  larger  number  of  iterations  to  settle  to 
a  clustering  result.  More  specifically,  the  number  of  required  iterations  is  of  the  order 
of  20,  50,  150  and  500  for  noise  levels  of  2.5,  5,  7.5  and  10%,  respectively.  For  high 
noise  levels  (10%),  more  than  one  cluster  is  generated  for  some  phoneme  classes. 
Misclassifications  are  occasionally  observed,  as  some  nodes  have  patterns  assigned 
to  them  from  more  than  one  phoneme  class.  Such  multi-phoneme  clusters  occur  in 
particular  when  the  number  of  nodes  in  the  SOLNN  is  reduced.  Characteristically, 
for  a  10%  level  of  noise,  multi-phoneme  clusters  occur  for  a  7-node  but  not  for  a  10- 
node  network.  The  formation  of  multi-phoneme  clusters  for  the  7-node  SOLNN  (see 
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Table  2)  is  due  to  the  fact  that  the  network  uses  two  nodes  for  the  “U”-phoneme 
class,  and  is  thus  unable  to  form  a  cluster  dedicated  exclusively  to  the  “0”-phoneme 
class.  It  is  worth  noting  that  multi-phoneme  clusters  consist  of  phonemes  which  have 
a  relatively  high  degree  of  similarity.  It  has  been  reported  [5]  that  for  the  optimal 
clustering  to  be  achieved  there  need  to  be  more  nodes  in  the  network  than  there 
are  classes  in  the  dataset.  Indeed,  this  is  confirmed  by  the  results  obtained  with 
the  10-node  SOLNN.  Still,  even  for  the  7-node  system,  the  SOLNN  succeeds  in 
consistently  clustering  the  majority  of  patterns  in  single-phoneme  clusters. 

For  the  HTN,  each  test  pattern  is  input  into  the  lower  layer  of  the  HTN.  In  contrast 
to  the  SOLNN,  a  single  pass  of  activation-flow  from  the  lower  to  the  upper  layer, 
at  a  pre-specified  k  value,  reveals  the  patterns  to  which  the  test  pattern  becomes 
clustered.  Optimum  clustering  with  a  HTN  requires  finding  one  (or  more)  value  of  k 
for  which  each  test  pattern  becomes  clustered  with  all  the  nodes  of  the  upper  layer 
representing  patterns  of  the  same  phoneme  class  but  with  no  nodes  representing 
patterns  of  different  phoneme  classes.  This  has  been  established  for  noise  levels  up 
to  7.5%  for  k  values  around  0.650.  The  range  of  k  values  which  result  in  the  optimum 
clustering  becomes  narrower  as  the  noise  level  increases.  This  can  be  explained  by 
the  fact  that  as  the  noise  level  rises  (i)  lower  k  values  are  required  for  grouping 
patterns  of  the  same  phoneme  class,  while  at  the  same  time,  (ii)  higher  k  values 
are  required  for  patterns  of  different  phoneme  classes  not  to  be  grouped  together. 
When  the  noise  level  reaches  10%,  no  value  of  k  generating  the  optimum  clustering 
can  be  found;  some  dataset  patterns  fail  to  be  clustered  with  all  the  nodes  of  the 
upper  layer  representing  patterns  the  same  phoneme  class,  while  they  are  clustered 
with  nodes  representing  patterns  of  other  phoneme  classes.  It  is  possible,  however, 
to  achieve  a  sub-optimum  clustering  by  raising  the  value  of  k,  whereas  the  problem 
of  multi-phoneme  clusters  is  avoided  at  the  expense  of  multiple  clusters  for  each 
phoneme  class.  As  shown  in  Table  2,  the  sub-optimum  clustering  performed  by  the 
HTN  for  the  10%  noise  level  is  due  to  the  network  grouping  together  an  “I”  with  an 
‘‘M” ,  which  are  the  two  phoneme  patterns  from  different  classes  with  the  minimum 
distance.  Due  to  the  fact  that  this  distance  is  equal  to  the  maximum  distance  within 
any  class  (see  Table  1),  the  structure  of  the  pattern  space  justifies  the  sub-optimal 
clustering  produced  by  the  HTN. 

For  the  statistical  analysis,  the  five  agglomerative  clustering  techniques  mentioned 
in  Section  2  have  been  used  to  cluster  all  patterns  in  the  minimum  possible  num¬ 
ber  of  groups,  while  avoiding  the  creation  of  any  multi-phoneme  clusters.  Both  the 
Hamming  distance  and  the  Euclidean  distance  were  considered  as  distance  mea¬ 
sures.  In  the  case  of  the  Hamming  distance  in  the  80-dimensional  space,  which  in 
the  case  of  binary  patterns  is  equal  to  the  square  of  the  Euclidean  distance,  the 
seven  phoneme  classes  are  consistently  separated  by  all  statistical  methods.  When 
using  the  Euclidean  distance  metric,  the  seven  phoneme  classes  are  separated  by 
all  statistical  methods  for  all  noise  levels  except  for  the  hierarchical  average  cluster 
method  for  the  10%-noise  level  dataset.  In  this  case,  for  seven  clusters,  the  phonemes 
“A”,  “O”,  “M”  and  “U”  are  grouped  together,  “N”  and  “S”  form  an  independent 
cluster  each,  whilst  “I”  is  split  into  four  clusters.  The  minimum  number  of  clusters, 
in  order  to  avoid  multi-phoneme  clusters,  is  fourteen. 
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5  Conclusions 

Both  the  neural  network  (SOLNN  and  HTN)  and  the  statistical  techniques  have 
been  found  to  perform  the  selected  clustering  task  satisfactorily.  For  low  noise 
levels,  all  techniques  cluster  the  dataset  successfully,  by  forming  exclusively  single¬ 
phoneme  clusters.  For  higher  noise  levels,  the  statistical  methods  always  generate 
the  optimum  clustering  according  to  the  Hamming  distance  metric,  as  witnessed  by 
the  study  of  the  dataset  structure.  The  quality  of  the  clustering  generated  by  the 
two  neural  network  models  is  slightly  inferior  to  that  of  the  statistical  techniques. 
This  is  indicated  by  the  creation  of  multiple  clusters  for  a  few  phoneme  classes. 
However,  the  vast  majority  of  clusters  consist  of  patterns  from  a  single  phoneme 
class  (see  Table  2),  thus  producing  successful  clustering. 

It  is  worth  noting  that  the  study  of  the  Hamming  distances  between  the  differ¬ 
ent  phoneme  patterns  in  the  dataset  indicates  that  the  clustering  behaviour  of  all 
three  techniques  is  justified.  In  particular,  for  the  noise  level  of  10%  when  the  most 
differences  between  the  clustering  results  of  the  three  methods  are  detected,  the 
minimum  distance  between  the  phoneme  classes  is  equal  to  the  maximum  distance 
between  phonemes  of  the  same  class.  This  allows  for  more  than  one  possible  clus¬ 
tering  results  of  almost  the  same  quality. 
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Pulsed  oscillatory  neural  networks  are  examined  for  application  to  analysis  and  segmentation  of 
multispectral  imagery  from  the  Satelite  Pour  I’Observation  de  la  Terre  (SPOT).  These  networks 
demonstrate  a  capacity  to  segment  images  with  better  performance  against  many  of  the  resolution 
uncertainty  effects  caused  by  local  area  adaptive  filtering.  To  enhance  synchronous  behavior,  a 
reset  mechanism  is  added  in  the  model.  Previous  work  suggests  that  a  reset  activation  pulse  is 
generated  by  sacatic  motor  commands.  Consequently,  an  algorithm  is  developed,  which  behaves 
similar  to  adaptive  histogram  techniques.  These  techniques  appear  both  biologically  plausible  and 
more  effective  than  conventional  techniques.  Using  the  pulse  time-of-euxival  as  the  information 
carrier,  the  image  is  reduced  to  a  time  signal  which  allows  an  intelligent  filtering  using  feedback. 

1  Introduction 

Histogram  image  analysis  may  play  an  important  role  in  biological  vision  image 
processing.  Structures  of  artificial  neurons  can  be  used  to  create  histogram  like  sig¬ 
nals  quickly.  In  this  paper,  we  will  examine  how  algorithms  based  on  fast  histogram 
processing  may  offer  advantages  for  computer  vision  systems. 

In  a  biological  vision  system,  the  signals  detected  at  the  retina  are  passed  to  the 
LGN  then  to  the  Visual  Cortex,  where  neighborhood  preserving  maps  of  the  retina 
are  repeated  at  least  15  times  over  the  surface  of  the  cortex.  For  every  path  forward, 
there  are  several  neurological  pathways  in  the  reverse  direction.  Reverse  direction 
transmission  of  information  suggest  feedback  signals.  Using  appropiate  feedback, 
the  image  processing  can  be  controlled  using  recurrence.  Pulses  generated  by  dy¬ 
namic  neuronal  models  suggest  a  method  for  building  recurrance  into  vision  models. 

1.1  Dynamic  Neural  Networks 

Dynamic  neural  networks,  first  examined  by  Stephen  Grossberg,  Maas  and  others 
[3]  were  an  attempt  to  construct  models  closer  to  their  biological  counterpart.  In 
this  model,  the  basic  computational  unit  is  not  the  usual  static  matrix  multiplier 
with  a  non-linear  transfer  function  between  layers,  but  a  leaky  integrator  with  a 
pulsed  oscillatory  output.  This  work  is  sometimes  refereed  to  as  Integrate  and  Fire 
Networks. 

To  understand  the  role  of  synchrony  in  the  cat  visual  cortex,  Eckhorn  devised  a 
model  which  would  replicate  some  of  this  behavior.  The  Eckhorn  [1]  dynamic  model 
represents  a  visual  neuron  as  a  multi-input  element  with  a  single  output.  The  output 
is  made  up  of  oscillatory  electrical  pulses  in  time.  Three  types  of  connections  make 
up  the  input  section:  the  feeding,  the  linking  and  the  direct  inputs.  The  direct 
input  provides  pixel  information.  The  feeding  inputs  provide  local  information  in 
the  region  of  the  direct  inputs.  The  combination  of  the  direct  and  feeding  input 
provide  texture  information  in  a  local  region  around  the  direct  input.  The  linking 
field  provides  a  global  connection  for  all  sub-regions  in  the  image.  The  linking 


361 


362 


Chapter  63 


Figure  1  Behavior  of  a  clean  and  noisy  signal. 


connections  enforce  synchrony  between  local  regions.  The  information  is  extracted 
using  correlation  techniques. 

Given  these  characteristics,  several  image  processing  functions  are  possible.  In  this 
paper  we  will  explore  only  one,  usually  the  most  difficult:  segmentation.  If  infor¬ 
mation  in  a  particular  part  of  the  image  is  similar  to  information  in  another  part 
of  the  image,  they  should  have  the  same  pulse  repetition  frequency  as  the  feeding 
field,  and  the  same  phase  after  the  linking  fields.  This  information  can  easily  be 
extracted  using  correlation  filters. 

For  a  single  neuron,  Uj{t)  is  given  by: 

=  +  (1) 

Xj(t),  are  the  direct  inputs  to  a  neuron,  Lj{t)  is  the  contribution  from  nearby 
neurons  weighted  by  (3  the  link  coefficients.  The  spiking  output,  F},  is  provided 
by  a  step  function  of  a  time  dependent  threshold  Bj{t)  and  the  neuronal  activity 
according  to 

Yj  step[Uj{i)-e^{t)\  (2) 

9j{t)  and  Lj{t)  can  have  different  decay  constants  oc$  and  ai,.  The  dynamics  of  the 
linking  field  and  threshold  are  given  by 

Lj(i)  =  ■  Yk{t)Mt)  =  (3) 

k 

where  Wk^  is  the  connection  weighting  for  the  local  field  neurons.  Sufficient  neural 
activity  combined  with  a  decaying  threshold  lead  to  a  spike  output  which  resets  the 
threshold  to  the  maximum  from  which  it  decays  again. 

Built  into  this  neuronal  model  are  a  number  of  characteristics  common  to  biological 
vision:  integrate  and  fire,  latency  and  lateral  interconnectivity.  Figure  1  provides  a 
means  to  visualize  the  firing  behavior  of  a  one  dimensional  signal  in  time.  In  the 
first  graph,  the  input  to  the  system  is  0.8.  The  energy  is  integrated  and  plotted.  The 
threshold,  a  decaying  exponential  relaxes  from  its  initial  maximum.  This  ensures  a 
minimum  pulse  rate  and  a  refractory  period.  When  the  two  are  equal,  the  neuron 
fires  and  both  6,  the  threshold,  and  Uj{t)  the  neuron  potential  are  reset.  In  the 
second  graph,  a  lower  energy  input  is  used,  which  is  characterized  by  a  slower  firing 
rate. 

For  large  clusters  of  neurons,  the  group  behavior  can  be  characterized  by  the  total 
firing  rates  over  time.  A  signal  based  on  the  total  number  of  neurons  firing  at  a 
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specific  time  could  be  used  as  a  control  signal  for  modification  of  the  structure  and 
appearance  of  the  image.  This  is  the  basis  for  using  pulse  coupled  neural  networks 
to  segment  images. 

In  the  following  graphs,  only  the  firing  patterns  are  shown.  A  one  dimensional  signal 
is  used  to  demonstrate  the  temporal  encoding  nature  of  the  output  signal.  The  first 
image,  without  using  coupling  between  neurons  shows,  how  the  pulse  frequency  rate 
is  related  to  the  input  strength. 

In  the  second  figure,  coupling  is  added  between  adjacent  neurons.  The  effect  is  to 
cluster  neurons  into  separate  time  bins.  By  evaluating  the  behavior  of  the  temporal 
signal,  processing  is  performed  on  the  actual  image. 


Figure  2 


2  Pulse  Coupled  Neurons  for  Adaptive  Histogram  Analysis  of 
Imagery 

Most  dynamic  models  of  neural  computation  suggest  that  synchronization  between 
regions  is  the  primary  information  carrier.  But  if  time  of  arrival  is  used  as  the  infor¬ 
mation  carrier,  it  demonstrates  that  mechanisms  exist  in  the  cellural  architectures 
to  perform  histogram  analysis  and  adaptive  histogram  analysis  of  images. 

The  following  32x32  image  chips  are  taken  for  the  SPOT  satellite  imagery  and 
represent  common  problems  in  image  segmentation. 

Several  variations  of  edge  detection  are  currently  used  on  these  type  of  images,  the 
most  common  being  the  Sobel  and  Canny  operators.  The  difficulty  lies  in  that  the 
size  of  the  window  used  for  the  operator  affects  the  size  of  the  detectable  gradients, 
and  the  width  of  the  edge.  An  improvement  on  the  Sobel  operator  is  the  Canny 
edge  detector,  which  effectively  finds  the  center  of  the  gradient  and  places  an  edge 
there.  Its  weakness,  as  with  most  convolution  filter  segmentation  methods,  is  that 
overall  resolution  is  usually  reduced  to  the  size  of  the  filter  window.  This  effect 
results  in  rounded  and  wide  edges.  In  low  resolution  systems  like  SPOT,  this  loss 
of  resolution  is  often  unacceptable. 
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Figure  3 

An  important  advantage  of  the  pulsed  coupled  methods  is  that  segmentation  is 
possible  for  objects  smaller  than  the  window  size.  In  the  above  image  a  road  is 
detected  having  a  width  less  than  two  pixels  wide  in  some  places.  General  threshold 
settings  for  the  previous  problem  are  not  compatible  with  detection  of  features  in 
the  second  image. 

The  adaptive  nature  of  the  algorithm  allows  pulses  in  the  histogram  to  represent 
whole  regions  in  the  segmented  image.  The  signal  can  be  used  to  modify  the  image 
on  the  next  cycle. 


Figure  4 


This  image  was  segmented  as  part  of  a  larger  project  to  perform  terrain  categoriza¬ 
tion  using  a  linear  approximation  to  the  Eckhorn  model  developed  by  Freyss.  For 
a  relatively  complex  image  such  as  this,  most  of  the  common  techniques  perform 
relatively  poorly.  The  advantage  with  this  new  technique  is  that  there  are  almost 
no  parameters  to  be  set  and  no  buried  decisions  being  made  by  the  operator. 

3  Conclusions  and  Future  Work 

The  model  we  developed  demonstrates  effective  image  segmentation  over  a  wide 
variety  of  image  class  types.  Although  many  of  the  results  demonstrated  here  could 
be  duplicated  using  standard  techniques,  these  methods  offer  a  simple  modular 
approach  to  the  image  analysis,  and  are  easily  implemented  in  silicon  devices. 

Our  work  suggests  that  the  group  behavior  of  clusters  of  neurons  provides  a  tech¬ 
nique  which  encodes  images  into  one  dimensional  signals  in  time.  Using  the  temporal 
encoded  group  output  as  a  control  signal  will  add  a  large  measure  of  robustness  to 
a  visual  system. 

An  important  aspect  of  this  work  is  the  new  approach  to  image  processing;  that  is 
expectation  filtering  for  particular  characteristics  with  feedback  to  enhance  desired 
qualities  in  the  signal.  Donaho^s  [2]  work  on  histogram  equalization  approximation 
for  manufacturing  processes  could  easily  be  implemented  using  this  architecture. 
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The  Reasoning  Neural  Network  (RN)  has  a  learning  algorithm  belonging  to  the  weight-and- 
structure-change  category,  because  it  puts  only  one  hidden  node  in  the  initial  network  structiu'e, 
and  will  recruit  and  pnme  hidden  nodes  during  the  learning  process.  Empirical  results  show  that 
learning  of  the  RN  is  guaranteed  to  be  completed,  the  number  of  required  hidden  nodes  is  reason¬ 
able,  that  the  speed  of  learning  is  much  faster  than  back  propagation  networks,  and  that  the  RN 
is  able  to  develop  good  internal  representation. 

1  Introduction 

Intuitively,  human  learning  consists  of  cramming  and  reasoning  at  a  high  level 
of  abstraction  [5].  This  observation  has  suggested  a  learning  algorithm  as  shown 
in  Figure  1.  This  learning  algorithm  belongs  to  the  weight- an d-structure-change 
category,  because  it  puts  only  one  hidden  node  initially,  and  will  recruit  and  prune 
hidden  nodes  during  the  learning  process.  Our  learning  algorithm  guarantees  to 
achieve  perfectly  the  goal  of  learning.  There  are  some  similar  learning  algorithms; 
however,  most  of  them  have  more  complex  pruning  strategies.  [7,  4,  1,  2]. 

2  The  RN’s  Network  Architecture 

The  RN  adopts  the  layered  feedforward  network  structure.  Let’s  suppose  that  the 
network  has  three  layers  with  m  input  nodes  at  the  bottom,  p  hidden  nodes,  and 
q  output  nodes  at  the  top.  Let  Be  €  {—1, 1}”^  be  the  cth  given  stimulus  input, 
bej  the  stimulus  value  received  in  the  jth  input  node  when  Be  is  presented  to  the 
network,  Wij  the  weight  of  the  connection  between  the  jth  input  node  and  the  zth 
hidden  node,  9i  the  negative  of  the  threshold  value  of  the  fth  hidden  node,  = 
{wii,  Wi2, WimY  the  vector  of  weights  of  the  connections  between  all  input  nodes 
and  the  zth  hidden  node,  where  the  superscript  t  indicates  the  transposition,  X*  = 
and  =  (Xj,  X2, X^.  Then,  given  the  stimulus  Be,  the  activation 
value  of  the  zth  hidden  node  is  computed: 

m 

/i(Bc,Xi)  =  tanh(^i  +  ^  Wijbcj). 

i=i 

Let  h(Bc,  X)  =  {h(Bc,  Xi),  h(Bc,  X2), . . . ,  h(Bc,  Xp)^  be  the  activation  value  vec¬ 
tor  of  all  hidden  nodes  when  Be  is  presented  to  the  network,  rji  the  weight  of  the  con¬ 
nection  between  the  fth  hidden  node  and  the  /th  output  node,  r?  =  (r/i,  r/2,  rjpY 
the  vector  of  weights  of  the  connections  between  all  hidden  nodes  and  the  /th  output 
node,  si  the  negative  of  the  threshold  value  of  the  /th  output  node,  Yf  =  (si,rj), 
=  (Yj,  Y2, ...,  Yp,  and  Z*  =  (Y^,X*).  The  activation  value  of  the  /th  output 
node  is  computed  after  h(Bc,X): 

p 

0(Bc,Yi,X)  ~  tanh(s/  +  ^  X,-)). 

X  =  1 


3  The  Learning  Algorithm 

The  block  diagram  of  the  learning  algorithm  is  shown  in  Figure  1 .  There  are  four 
distinguished  aspects  of  this  learning  algorithm:  the  linearly  separating  condition 
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Figure  1  The  block  diagram  of  the  learning  algorithm.  The  details  of  the  thinking 
box,  the  reasoning  box  and  the  cramming  box  are  shown  in  Figitre  2  and  Figure  3. 


The  Generaized  Delta  Rule  Part  The  Reaaoning  Part 


Figure  2  The  Generalised  Delta  Rule  part,  the  thinking  part  and  the  reasoning 
part.  The  values  of  given  constants  a  and  u;  in  the  Generalised  Delta  Rule  part  are 
tiny. 

(LSC),  the  thinking  mechanism,  the  cramming  mechanism;  and  the  reasoning  mech¬ 
anism.  These  aspects  are  explained  in  the  following. 

With  respect  to  each  output  node,  the  network  is  used  as  a  classifier  which  learns 
to  distinguish  if  the  stimulus  is  a  member  of  one  class  of  stimuli,  called  class  1,  or 
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1  — +  p,  then  adds  the  new  hidden  node  with  Wp  =  Bk,0p  =  1  —  m,  and 
rip  =  0  for  every  I  E  L,  and,  for  every  I  ^  L 

{  riih{By,Xi)  -  Y7iZi  ruh{Bk,Xi)  if  dki  -  1.0 

rip  =  < 

\  min^^g r/j /i(Bu , Xi)  —  r;j/i(Bfc,Xi)  if  d^i  =  —1.0 


Figure  3  The  Cramming  part.  Suppose  that  I  E  L  if,  before  implementing  the 
cramming  mechanism,  the  LSC  with  respect  to  the  /th  output  node  is  satisfied. 


of  a  different  class,  called  class  2,  by  being  presented  with  exemplars  of  each  class. 
With  respect  to  the  /th  output  node,  let  K  =  Kn  U  where  Kn  and  K12  are  the 
sets  of  indices  of  all  given  training  stimuli  in  classes  1  and  2,  respectively;  and  let 
del  be  the  desired  output  value  of  the  /th  output  node  of  the  cth  stimulus,  with  1.0 
and  —1.0  being  respectively  the  desired  output  values  of  classes  1  and  2.  Learning 
seeks  Z  where,  for  all  /, 

defO(Bc,Y,,X)>u  VceA'  (1) 

and  0  <  ?;  <  1.  With  respect  to  the  /th  output  node,  let  the  LSC  be  that 

min  0(Bc,  Y/,X)  >  max  0(Bc,  Y/,X). 

c&Kii  cEKi2 

When  the  LSC  with  respect  to  the  /th  output  node  is  satisfied,  the  requirement  (1) 
with  respect  to  the  /th  output  node  could  be  achieved  by  merely  adjusting  Y/  [5]. 
At  the  learning  stage,  the  training  stimuli  are  presented  one  by  one.  At  the  kih. 
given  stimulus,  the  objective  function  is  defined  as: 

k  q 

c=i  ;=i 

and  let  K(k)  =  {1,  ...,k}  and  K(k)  =  U  Ki2{k),  where  Kii{k)  and  Ki2{k) 

are,  respectively,  the  sets  of  indices  of  the  first  k  training  stimuli  in  classes  1  and  2, 
with  respect  to  the  /th  output  node.  Then  the  thinking  mechanism  is  implemented, 
in  which  the  momentum  version  of  the  generalized  delta  rule  (with  automatic  ad¬ 
justment  of  learning  rate)  is  adopted.  Learning  might  converge  to  the  bottom  of  a 
very  shallow  steep-sided  valley  [3],  where  the  magnitude  of  the  gradient  will  be  tiny 
and  the  consecutive  adaptive  learning  rates  will  also  be  tiny.  Therefore,  as  shown  in 
the  generalized  delta  rule  part  of  Figure  2,  these  two  criteria  are  adopted  to  detect 
if  the  learning  hits  the  neighborhood  of  an  undesired  attractor. 

The  desired  solution  is  not  required  to  render  the  requirement  (1)  satisfied  or  to 
be  a  stationary  point  in  which  VzECZ)  —  0.  Thus,  the  magnitude  of  ||V^F/(Z)|| 
before  hitting  the  desired  solution  is  not  necessarily  tiny  and  the  learning  time  is 
rather  less,  compared  with  conventional  stopping  criteria  (for  example,  small  E{Z) 
or  ||V.£(Z)||  =  0). 

The  thinking  mechanism  does  not  guarantee  that  the  LSC  with  respect  to  all  output 
nodes  will  be  satisfied.  Two  ideas  could  render  the  learning  capable  of  escaping 
from  the  undesired  attractor:  add  a  hidden  node  and  alter  the  objective  function. 
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By  adding  a  hidden  node,  the  dimension  of  the  weight  space  is  increased;  while 
altering  the  objective  function  will  change  its  function  surface  on  the  weight  space 
such  that  the  trapped  attractor  could  be  no  more  an  attractor.  These  two  ideas  are 
implemented  in  our  learning  algorithm  via  the  cramming  mechanism  and  that  the 
objective  function  is  altered  by  introducing  a  new  training  stimulus. 

The  cramming  mechanism  can  be  viewed  as  that,  at  first  a  new  hidden  node  with 
the  near  threshold  activation  function  is  added,  then  the  softening  mechanism  of 
[6]  is  used  immediately  to  render  the  activation  function  of  the  new  hidden  node  a 
tanh  one.  The  Lemma  in  [6]  shows  that  the  mechanism  of  adding  anew  hidden  node 
with  the  near  threshold  activation  function  and  a  finite  value  of  the  gain  parameter 
can  immediately  render  the  LSC  with  respect  to  all  output  nodes  satisfied,  if  the 
training  set  has  no  internal  conflicts  (different  outputs  for  the  same  input). 
However  the  number  of  required  hidden  nodes  may  be  too  many  and  the  generaliza¬ 
tion  ability  of  the  network  may  be  bad.  Thus  it  is  necessary  to  adopt  the  reasoning 
mechanism  for  the  purpose  of  rendering  the  network  more  compact.  The  reasoning 
mechanism  includes  the  thinking  mechanism  and  the  pruning  mechanism  of  remov¬ 
ing  irrelevant  hidden  nodes.  In  a  Z,  the  ith  hidden  node  is  said  to  be  irrelevant  to 
the  LSC  with  respect  to  the  /th  output  node  if  the  LSC  is  still  satisfied  with  the 
same  Z  except  ru  =  0;  and  a  hidden  node  is  irrelevant  if  it  is  irrelevant  to  the  LSC 
with  respect  to  all  output  nodes  [5]. 

4  The  Performance  of  the  RN 

We  report  three  experiments.  In  each  simulation,  there  are  100  testing  cases,  each 
with  different  input  sequence  of  training  stimuli.  One  experiment  is  the  m-bits 
parity  learning  problem.  In  Figure  4a,  the  numbers  of  used  hidden  nodes  during 
the  6-bits  parity  learning  process  are  plotted.  The  variance  of  p  is  due  to  the  different 
input  sequence  of  training  stimuli.  Figure  4b  shows  the  summary  of  the  simulation 
results,  and  it  shows  that  the  average  value  p  of  the  m-bits  parity  problems  is  merely 
a  little  bigger  than  m.  However,  it  is  surprising  to  see  that  the  RN  can  solve  the 
m-bits  parity  problems  with  less  than  m  hidden  nodes. 

The  output/hidden  experiment  is  used  to  identify  the  relationship  between  the 
number  of  required  hidden  nodes  and  the  number  of  used  output  nodes.  The  number 
of  input  nodes  is  fixed  to  be  8.  The  training  stimuli  and  their  input  sequence  are 
randomized;  but  the  number  of  used  output  nodes  is  varied  from  1  to  6.  In  Figure 
4c,  the  simulation  result  of  the  “m  =  8,^  =  3,  and  K  =  100”  problem  are  plotted. 
Figure  4d  shows  the  summary  of  the  simulation  results,  and  it  shows  that  the 
relationship  between  the  average  value  p  of  the  RN  and  the  value  of  q  is  rather  a 
linear  one. 

One  significant  phenomenon  of  above  simulations  is  that  the  value  of  q  influences 
the  value  of  p  more  significantly  than  the  values  of  K  and  m  do.  Another  interesting 
phenomenon  is  that  if  there  is  no  correlation  within  current  given  training  stimuli, 
the  RN  tends  to  cram  them  (in  other  words,  memorize  them  individually)  by  using 
many  more  hidden  nodes.  But  when  there  are  correlation  within  the  current  given 
training  stimuli,  it  seems  that  the  RN  will  figure  out  a  smart  way  to  classify  the 
given  training  stimuli  by  using  less  hidden  nodes.  In  other  words,  the  RN  has  the 
ability  of  developing  a  good  internal  representation  for  the  given  training  stimuli. 
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Figure  4  a)  The  simulation  result  of  the  6-bits  parity  problem,  b)  The  summary 
of  simulation  results  of  the  m-bits  parity  problem,  c)  The  simulation  result  of  the 
“m  =  8,  g  =  3,  cind  K  =  100”  problem,  d)  The  smnmary  of  simulation  results  of 
the  output/hidden  problems,  e)  The  simulation  result  of  the  5-p-5  problem. 

The  third  experiment  is  the  5-p-5  problem  (Figure  4e)),  in  which  the  training  stimuli 
are  the  same  as  those  of  the  5-bits  parity  problem  and  the  desired  output  vector  is 
the  same  as  the  stimulus  input  vector.  Somewhat  surprisingly,  as  shown  in  Figure 
5,  each  hidden  node  of  one  final  RN  has  only  one  strong  connection  strength  from 
input  nodes,  and  each  output  node  has  only  one  strong  connection  strength  from 
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Figure  5  One  final  RNN  of  the  5-p-5  encoder  problem,  where  we  show  only  the 
connections  with  weights  of  large  magnitude,  and  the  signs  of  their  weights.  Note 
that  the  signs  of  two  connected  strongest  connections  are  the  same. 


hidden  nodes.  In  addition,  different  hidden  nodes  have  strongest  connections  from 
different  input  nodes,  different  output  nodes  have  strongest  connections  from  a 
different  hidden  node,  and  the  signs  of  connected  strongest  connections  are  the 
same.  It  seems  that,  after  learning  the  full  set  of  training  stimuli,  the  RN  had 
learned  to  use  the  hidden  nodes  to  bypass  the  input  stimulus,  rather  than  to  encode 
them. 

5  Discussions  and  Future  work 

The  empirical  results  show  that  the  learning  of  the  RN  is  much  faster  than  the 
back  propagation  learning  algorithm,  and  that  the  RN  is  able  to  develop  good  in¬ 
ternal  representation  with  good  generalization.  The  RN  has  flexibility  in  its  learning 
algorithm;  different  algorithms  have  been  obtained  by  integrating  the  prime  mech¬ 
anisms  in  different  way.  These  yield  different  simulation  results:  it  seems  that  we 
should  use  different  management  in  the  RN  for  different  application  problems. 
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The  storage  capacity  of  multilayer  networks  with  overlapping  receptive  fields  is  investigated  for 
a  constructive  algorithm  within  a  one-step  replica  symmetry  breaking  (RSB)  treatment.  We  find 
that  the  storage  capacity  increases  logarithmically  with  the  munber  of  hidden  units  K  without  sat¬ 
urating  the  Mitchison-Durbin  bound.  The  slope  of  the  logarithmic  increase  decays  exponentionally 
with  the  stability  with  which  the  patterns  have  been  stored. 

1  Introduction 

Since  the  ground  breaking  work  of  Gardner  [1]  on  the  storage  capacity  of  the  per- 
ceptron,  the  replica  technique  of  statistical  mechanics  has  been  successfully  used 
to  investigate  many  aspects  of  the  performance  of  simple  neural  network  mod¬ 
els.  However,  progress  for  multilayer  feedforward  networks  has  been  hampered  by 
the  inherent  difficulties  of  the  replica  calculation.  This  is  especially  true  for  ca¬ 
pacity  calculations,  where  replica  symmetric  (RS)  treatments  [2]  violate  the  upper 
Mitchison-Durbin  bound  [3]  derived  by  information  theory.  Other  efforts  [4]  break 
the  symmetry  of  the  hidden  units  explicitly  prior  to  the  actual  calculation,  but 
the  resulting  equations  are  approximations  and  difficult  to  solve  for  large  networks. 
This  paper  avoids  these  problems  by  addressing  the  capacity  of  a  class  of  networks 
with  variable  architecture  produced  by  a  constructive  algorithm.  In  this  case,  re¬ 
sults  derived  for  simple  binary  perceptrons  above  their  saturation  limit  [5]  can  be 
applied  iteratively  to  yield  the  storage  capacity  of  two-layer  networks. 

Constructive  algorithms  (e.g.,  [6,  8])  are  based  on  the  idea  that  in  general  it  is  a 
priori  unknown  how  large  a  network  must  be  to  perform  a  certain  classification 
task.  It  seems  appealing  therefore  to  start  off  with  a  simple  network,  e.g.,  a  binary 
perceptron,  and  to  increase  its  complexity  only  when  needed.  This  procedure  has 
the  added  advantage  that  the  training  time  of  the  whole  network  is  relatively  short, 
since  each  training  step  consists  of  training  the  newly  added  hidden  units  only, 
whereas  previously  constructed  weights  are  kept  fixed.  Although  constructive  al¬ 
gorithm  seem  therefore  rather  appealing,  their  properties  are  not  well  understood. 
The  aim  of  this  paper  is  to  analyse  the  performance  of  one  constructive  algorithm, 
the  upstart  algorithm  [8],  in  learning  random  dichotomies,  usually  referred  to  as 
the  capacity  problem. 

The  basic  idea  of  the  upstart  algorithm  is  to  start  with  a  binary  perceptron  unit 
with  possible  outputs  {1,0}.  Further  units  are  created  only  if  the  initial  perceptron 
makes  any  errors  on  the  training  set.  One  unit  may  have  to  be  created  to  correct 
WRONGLY  ON  errors  (where  the  target  was  0  but  the  actual  output  is  1)  another  to 
correct  wrongly  OFF  errors  (where  the  target  was  1  but  the  output  is  0).  If  these 
units  still  cause  errors  in  the  output  of  the  network,  more  units  are  created  in  the 
next  generation  of  the  algorithm  until  all  outputs  are  correct.  Different  versions  of 
the  upstart  algorithm  differ  in  the  way  new  units  are  connected  to  the  old  units  and 
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to  the  output  unit.  The  original  upstart  algorithm  produces  a  hierarchical  network 
where  the  number  of  hidden  units  tends  to  increase  exp onention ally  with  each 
generation.  Other  versions  of  the  upstart  algorithm[8]  build  a  two-layer  architecture 
and  show  only  a  linear  increase  of  the  number  of  units  with  each  generation,  which 
is  in  general  easier  to  implement. 

We  have  therefore  analysed  a  non- hierarchical  version  of  the  upstart  algorithm. 
Within  a  one-step  replica  symmetry  breaking  (RSB)  treatment  [9],  networks  con¬ 
structed  by  the  upstart  algorithm  show  a  logarithmic  increase  of  the  capacity  with 
the  number  of  nodes  in  agreement  with  the  Mitchison-Durbin  bound 

(ttcocln/^/  In  2) 

,  whereas  the  simpler  RS  treatment  violates  this  bound.  Furthermore,  the  algo¬ 
rithm  does  not  saturate  the  Mitchison-Durbin  bound  for  zero  stability.  We  further 
find  that  the  slope  of  the  logarithmic  increase  of  the  capacity  against  network  size 
decreases  exponentionally  with  the  stability. 

2  Model  Description  and  Framework 
2.1  Definition  of  the  Upstart  Algorithm 

The  upstart  algorithm  first  creates  a  binary  perceptron  (or  unit)  Vq  which  learns 
a  synaptic  weight  vector  W  €  IR^  and  a  threshold  6  which  minimize  the  error  on 
a  set  of  p  input-output  mappings  G  {—1, 1}^  — >•  G  {0, 1}  (/i  ==  1, . . .  ,p)  from 
an  iV-dimensional  binary  input  space  to  binary  targets.  The  output  of  the  binary 
perceptron  is  determined  by 

a'‘  =  @(^f=W-e-e^=0{h'‘) 

where  0{x)  is  the  Heavyside  stepfunction,  which  is  1  for  a:  >  0  and  0  otherwise, 
and  is  the  activation  of  the  perceptron.  The  error  is  defined  as 

E  =  Y,e[K-2{e-i)h'‘], 

where  k  is  the  stability  with  which  we  require  the  patterns  to  be  stored.  A  suitable 
algorithm  (e.g.,  [10])  will  converge  to  a  set  of  weights  W  which  minimizes  the  above 
error.  If  the  set  of  examples  is  not  linearly  separable  with  a  minimum  distance  k 
of  all  patterns  to  the  hyperplane,  the  binary  perceptron  will  not  be  able  to  classify 
all  patterns  correctly,  i.e.,  <t^  ^  for  some  p’s  and  the  upstart  algorithm  has  to 
create  further  daughter  units  an  a  hidden  layer  to  realize  the  mapping.  The  upstart 
algorithm  therefore  creates  a  binary  {0, 1}  output  unit  O  with  threshold  one  and 
the  initial  perceptron  Vq  and  all  further  daughter  units  to  be  built  by  the  algorithm 
will  form  the  hidden  layer.  The  first  perceptron  is  then  connected  to  O  with  a  -1-1 
weight,  i.e.,  O  has  initially  the  same  outputs  as  Vq- 

The  basic  idea  of  the  upstart  algorithm  is  to  create  further  daughter  units 
and  T>~  in  the  hidden  layer  to  correct  wrongly  OFF  and  wrongly  on  errors 
respectively.  Consider,  for  example,  the  creation  of  the  new  hidden  unit  X>“,  which 
is  connected  with  a  large  negative  weight  to  O,  whose  role  is  to  inhibit  O.  V~ 
should  be  active  (1)  for  patterns  for  which  O  was  WRONGLY  ON  and  inactive  (0) 
for  patterns  for  which  O  was  CORRECTLY  ON.  Similarly,  V~  ought  to  be  0  if  O 
was  WRONGLY  OFF,  in  order  to  avoid  further  inhibition  of  O.  However,  we  do  not 
have  to  train  V~  on  patterns  for  which  O  was  CORRECTLY  OFF,  since  an  active  X>" 
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would  only  reinforce  O’s  already  correct  response.  The  resulting  training  sets  and 
the  targets  of  both  daughter  units  are  illustrated  in  Table  1.  More  formally  we 


o 

II 

CORRECTLY  ON 
V+  * 

V-  0 

WRONGLY  ON 

0 

V-  1 

cr  =  0 

WRONGLY  OFF 
V+  1 

V-  0 

CORRECTLY  OFF 

V+  0 

V-  * 

Table  1  The  targets  of  the  upstart  II  algorithm  depending  on  the  requested  target 
C  cind  the  actual  output  a  of  the  output  iinit  O.  The  teirget  means  that  the 
pattern  is  not  included  in  the  training  set  of 


define  the  algorithm  upstart  II  by  the  following  steps  which  are  applied  recursively 
until  the  task  is  learned: 

Step  0:  Follow  the  above  procedure  for  the  original  unit  Vq  and  the  creation  of 
the  output  unit  O.  Evaluate  the  number  of  WRONGLY  OFF  and  wrongly  on 
errors. 

Step  1:  If  the  output  unit  O  of  the  upstart  network  of  i  generations  makes  more 
WRONGLY  OFF  than  WRONGLY  ON  errors,  a  new  unit  created  and 

trained  on  the  training  set  and  targets  given  in  Table  1.  If  there  are  more 
WRONGLY  ON  than  WRONGLY  OFF  errors,  a  new  unit  is  created  with 

training  set  and  targets  also  given  in  Table  1.  If  both  kind  of  errors  occur 
equally,  two  units  X>7fi  created  with  training  sets  and  targets  as 

above. 

Step  2:  The  new  units  are  trained  on  their  training  sets  and  their  weights  are 
frozen.  The  units  are  then  connected  with  positive,  negative  weights 

to  the  output  unit  respectively.  The  modulus  of  the  weights  are  adjusted  so  that 
'^f+i  overrules  any  previous  decisions  if  active.  The  total  number  of  WRONGLY 
OFF  and  WRONGLY  ON  errors  of  the  upstart  network  of  generation  2  +  1  is  then 
reevaluated.  If  the  network  still  makes  errors  the  algorithm  goes  back  to  Step  1. 

The  algorithm  will  eventually  converge  as  a  daughter  unit  will  always  be  able  to 
correct  at  least  one  of  the  previously  misclassified  patterns  without  upsetting  any 
already  correctly  classified  examples. 

2.2  Statistical  Mechanics  Framework  for  Calculating  the 
Capacity  Limit 

Since  the  upstart  algorithm  trains  only  perceptrons,  we  can  apply  knowledge  of 
the  capacity  limit  and  of  the  error  rate  of  perceptrons  above  saturation  derived  in 
a  statistical  mechanics  framework  to  calculate  the  capacity  limit  of  the  upstart  II 
algorithm  for  an  arbitrary  number  of  generations.  Below,  we  briefly  review  this 
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statistical  mechanics  calculation  and  refer  the  reader  to  [5]  and  to  previous  work  [1] 
for  a  more  detailed  treatment. 

In  the  capacity  problem  the  aim  is  to  find  the  maximum  number  p  of  random  input- 
output  mappings  of  binary  A^-dimensional  input  vectors  to  targets  G  {0, 1}, 
which  can  be  realized  by  a  network  on  average.  We  assume  that  each  component  of 
the  input  vectors  is  drawn  independently  with  equal  probability  from  {—1,1}. 
The  distribution  of  targets  is  taken  to  be  pattern  independent  with  a  possible  bias 
b:  PC)  =  1(1  -f-  6)<5(1  —  C)  +  1(1  “  ^)^(C)-  We  will  here  only  consider  an  unbiased 
output  distribution  for  the  intial  perceptron.  The  target  distributions  for  daughter 
units  however  will  in  general  be  biased. 

Each  binary  perceptron  is  trained  stochastically  and  we  only  allow  weight  vector 
solutions  with  the  minimal  achievable  error.  The  error  rate,  i.e.,  the  number  of  er¬ 
rors  divided  by  the  total  number  of  examples,  is  assumed  to  be  self- averaging  with 
respect  to  the  randomness  in  the  training  set  in  the  thermodynamic  limit  N  oo. 
In  this  limit  the  natural  measure  for  the  number  of  examples  p  is  a  =  p/N .  With 
increasing  a  the  weight  space  of  possible  solutions  shrinks,  leaving  a  unique  solu¬ 
tion  at  the  capacity  limit  of  the  binary  perceptron.  Above  the  capacity  limit  many 
different  weight  space  solutions  with  the  same  error  are  possible.  In  general  the  so¬ 
lution  space  will  be  disconnected  as  two  solutions  can  possibly  missclassify  different 
patterns.  As  a  diverges,  the  solution  space  becomes  increasingly  fragmented. 

The  replica  trick  is  used  to  calculate  the  solution  space  and  the  minimal  error  rate 
averaged  over  the  randomness  of  the  training  set.  This  involves  the  replication  of 
the  perceptron  weight  vector,  each  replica  representing  a  different  possible  solution 
to  the  same  storage  problem.  In  order  to  make  significant  progress,  one  has  further 
to  assume  some  kind  of  structure  in  the  replica  space.  Below  the  capacity  limit, 
the  connectedness  of  the  solution  space  is  reflected  by  the  correctness  of  a  replica 
symmetric  (RS)  ansatz.  Above  the  capacity,  the  disconnectedness  of  the  solution 
space  breaks  the  RS  to  some  degree.  We  have  restricted  ourselves  to  a  one-step 
replica  symmetry  breaking  (RSB)  calculation,  which  is  expected  to  be  at  least 
sufficient  for  small  error  rates.  The  form  of  the  equations  for  the  error  rate  resulting 
from  the  RS  and  one-step  RSB  calculations  are  quite  cumbersome  and  will  be 
reported  elsewhere  [5,  11].  For  the  perceptron,  the  error  rate  is  a  function  of  the 
output  bias  h  and  the  load  a  only. 

3  Results  of  the  Upstart  Algorithm 

The  capacity  of  an  upstart  network  with  K  hidden  units  can  now  be  calculated. 
The  initial  perceptron  is  trained  with  an  example  load  of  a  and  an  unbiased  output 
distribution  6  =  0.  The  saddlepoint  equations  and  the  wrongly  on  and  WRONGLY 
OFF  error  rates  are  calculated  numerically.  These  error  rates  determine  the  load 
and  bias  for  the  unit(s)  to  be  created  in  the  next  generation.  Now  its  (their)  error 
rates  and  the  errors  of  the  output  unit  can  in  turn  be  calculated  by  solving  the 
saddlepoint  equations.  This  is  iterated  until  K  units  have  been  built.  If  the  output 
unit  still  makes  error,  we  are  above  the  capacity  limit  of  the  upstart  net  with  K 
hidden  units  and  a  has  to  be  decreased.  On  the  other  hand,  if  the  output  unit  makes 
no  errors,  a  can  be  increased.  The  maximal  a  for  which  the  output  unit  makes  no 
errors  defines  the  saturation  point  of  the  network.  The  capacity  limit,  defined  here 
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Figure  1  (a)  Within  the  one-step  RSB  theory,  the  capacity  ac  increases  log¬ 

arithmically  with  the  number  of  hidden  xmits  K  for  large  K  for  the  stabilities 
K  =z  0  (O.l),  i.e.,  Oc  oc  0.3595  (0.182)lnK’  (see  superimposed  asymptotics).  The  RS 
theory  violates  the  Mitchison- Durbin  bound  (third  asymptotic:  ac  oc  In  A'/ In  2)  for 
K  >  180.  (b)  The  slope  7  of  the  logarithmic  increase  of  the  capacity  decreases 
exponentionaJly  with  the  stability  «. 

as  the  maximal  number  of  examples  per  adjustable  weight  of  the  network,  then 
becomes  simply  ac{K)  =  a/K. 

In  Fig.  la  we  present  the  storage  capacity  as  a  function  of  the  number  of  hidden  units 
for  both  a  one-step  RSB  and  a  RS  treatment  at  zero  stability  of  the  patterns  (/c  =  0). 
Whereas  one-step  RSB  predicts  a  logarithmic  increase  cvc(^^)  oc  ln(A'')  for  large 
networks,  in  agreement  with  the  Mitchison-Durbin  bound,  the  results  for  the  RS- 
theory  violate  this  upper  bound^,  i.e.,  the  RS  theory  fails  to  predict  the  qualitative 
behaviour  correctly. 

In  Fig.  la  we  also  show  that  the  storage  capacity  still  increases  logarithmically  with 
the  number  of  units  K  for  non-zero  stability,  but  with  a  smaller  slope  7.  Fig.  lb 
shows  the  dependence  of  the  slope  7  as  a  function  of  the  stability  k  for  one-step 
RSB.  The  maximal  slope  for  zero  stability  7  =  0.3595  ±  0.0015  does  not  saturate  the 
Mitchison-Durbin  bound  7  =  l/ln2  1.4427,  but  is  about  four  times  lower.  With 
increasing  stabilities  k  this  slope  decreases  exponentionally  7  oc  exp(— 6.77  ±  0.02  k). 

4  Summary  and  Discussion 

The  objective  of  this  work  has  been  to  calculate  the  storage  capacity  of  multilayer 
networks  created  by  the  constructive  upstart  algorithm  in  a  statistical  mechanics 
framework  using  the  replica  method.  We  found  that  the  RS-theory  fails  to  predict 
the  correct  results  even  qualitatively.  The  one-step  RSB  theory  yields  qualitatively 
and  quantitatively  correct  results  over  a  wide  range  of  network  sizes  and  stabilities. 
In  the  one-step  RSB  treatment,  a  logarithmic  increase  with  slope  7  of  the  capacity  of 
the  upstart  algorithm  with  the  number  of  units  K  was  found  for  all  stabilities.  The 
slope  decreases  exponentionally  [7  oc  exp(— 6.77k)]  with  the  stability  k.  It  would  be 
interesting  to  investigate  if  this  result  carries  over  to  other  constructive  algorithms 
or  even  to  general  two-layer  networks. 


^The  violation  occurs  for  K  >  180  and  the  largest  networks  in  the  RS  case  were  K  =  999. 
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For  zero  stability  the  slope  of  this  increase  is  around  four  times  smaller  than  the 
upper  bound  (l/ln2)  predicted  by  information  theory.  We  suggest  that  this  indi¬ 
cates  that  the  upstart  algorithm  uses  its  hidden  units  less  effectively  than  a  general 
two-layer  network.  We  think  this  is  due  to  the  fact  that  the  upstart  algorithm  uses 
the  hidden  units  to  overrule  previous  decisions,  resulting  in  an  exponential  increase 
of  the  hidden  layer  to  output  unit  weights.  This  is  in  contrast  to  general  two-layer 
networks  which  usually  have  hidden-output  weights  of  roughly  the  same  order  and 
can  therefore  explore  a  larger  space  of  internal  representations.  For  the  upstart 
algorithm  a  large  number  of  internal  representations  are  equivalent  and  others  can¬ 
not  be  implemented  as  they  are  related  to  erroneous  outputs.  However,  it  would 
be  interesting  to  investigate  how  other  constructive  algorithms  (e.g.,  [6])  perform 
in  comparison.  A  systematic  investigation  of  the  storage  capacity  of  constructive 
algorithms  may  ultimately  lead  to  a  better  understanding,  and  thus  possibly  to 
novel,  much  improved  algorithms. 
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The  Bayesian  analysis  of  neural  networks  is  difficult  because  the  prior  over  functions  has  a  complex 
form,  leading  to  implementations  that  either  make  approximations  or  use  Monte  Carlo  integration 
techniques.  In  this  paper  I  investigate  the  use  of  Gaussian  process  priors  over  functions,  which 
permit  the  predictive  Bayesian  analysis  to  be  carried  out  exactly  using  matrix  operations.  The 
method  has  been  tested  on  two  challenging  problems  and  has  produced  excellent  results. 

1  Introduction 

In  the  Bayesian  approach  to  neural  networks  a  prior  distribution  over  the  weights 
induces  a  prior  distribution  over  functions.  This  prior  is  combined  with  a  noise 
model,  which  specifies  the  probability  of  observing  the  targets  t  given  function 
values  y,  to  yield  a  posterior  over  functions  which  can  then  be  used  for  predictions. 
For  neural  networks  the  prior  over  functions  has  a  complex  form  which  means 
that  implementations  must  either  make  approximations  [4]  or  use  Monte  Carlo 
approaches  to  evaluating  integrals  [6] . 

As  Neal  [7]  has  argued,  there  is  no  reason  to  believe  that,  for  real-world  problems, 
neural  network  models  should  be  limited  to  nets  containing  only  a  “small”  number 
of  hidden  units.  He  has  shown  that  it  is  sensible  to  consider  a  limit  where  the 
number  of  hidden  units  in  a  net  tends  to  infinity,  and  that  good  predictions  can  be 
obtained  from  such  models  using  the  Bayesian  machinery^.  He  has  also  shown  that 
a  large  class  of  neural  network  models  will  converge  to  a  Gaussian  process  prior 
over  functions  in  the  limit  of  an  infinite  number  of  hidden  units. 

Although  infinite  networks  are  one  method  of  creating  Gaussian  processes,  it  is 
also  possible  (and  computationally  easier)  to  specify  them  directly  using  paramet¬ 
ric  forms  for  the  mean  and  covariance  functions.  In  this  paper  I  investigate  using 
Gaussian  processes  specified  parametrically  for  regression  problems^,  and  demon¬ 
strate  very  good  performance  on  the  two  test  problems  I  have  tried.  The  advantage 
of  the  Gaussian  process  formulation  is  that  the  integrations,  which  have  to  be  ap¬ 
proximated  for  neural  nets,  can  be  carried  out  exactly  (using  matrix  operations)  in 
this  case.  I  also  show  that  the  parameters  specifying  the  Gaussian  process  can  be 
estimated  from  training  data,  and  that  this  leads  naturally  to  a  form  of  “Automatic 
Relevance  Determination”  [4],  [7]. 

2  Prediction  with  Gaussian  Processes 

A  stochastic  process  is  a  collection  of  random  variables  {y(a;)|a;  E  A}  indexed  by 
a  set  A.  Often  A  will  be  a  space  such  as  IR^  for  some  dimension  d,  although  it 
could  be  more  general.  The  stochastic  process  is  specified  by  giving  the  probability 
distribution  for  every  finite  subset  of  variables  Y{xi), . . .  ,Y{xk)  in  a  consistent 
manner.  A  Gaussian  process  is  a  stochastic  process  which  can  be  fully  specified  by 
its  mean  function  pL{x)  =  E[Y{x)]  and  its  covariance  function  C{x,  x')  =  E[iY{x)-~ 

^  Large  networks  cannot  be  successfully  used  with  maximiun  likelihood  training  because  of  the 
overfitting  problem. 

^By  regression  problems  I  mean  those  concerned  with  the  prediction  of  one  or  more  real- valued 
outputs,  as  compcired  to  classification  problems. 
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/i(aj))(y  (x')-fi{x'))];  any  finite  set  of  points  will  have  a  joint  multivariate  Gaussian 
distribution. 

Below  I  consider  Gaussian  processes  which  have  g(x)  =  0.  This  is  the  case  for 
many  neural  network  priors  [7],  and  otherwise  assumes  that  any  known  offset  or 
trend  in  the  data  has  been  removed.  A  non-zero  g{x)  can  be  incorporated  into  the 
framework,  but  leads  to  extra  notational  complexity. 

Given  a  prior  covariance  function  Cp{x^x’),  a  noise  process  Cn{x,x')  (with 
Cn(x,x')  =  0  for  X  ^  x')  and  data  V  =  {(xi,ti),  (x2,t2), .  ..,{xn,tn)),  the  pre¬ 
diction  for  the  distribution  of  Y  corresponding  to  a  test  point  x  is  obtained  simply 
by  marginalizing  the  (n  +  l)-dimensional  joint  distribution  to  obtain  the  mean  and 
variance 

y(x)  =:  kl{x){Kp  KnTH  (1) 

al{x)  =  Cp{x,x)-\-CN{x,x)~k^{x){KpYKNy^kp{x)  (2) 

where  [Ka]ij  =  Ca{xi,Xj)  for  a  =  P,N,  kp{x)  =  {Cp(x,xi), . . .  ,Cp{x,Xn)y 
and  t  =  {tiy. , .  cr|(®)  gives  the  “error  bars”  of  the  prediction.  In  the  work 

below  the  noise  process  is  assumed  to  have  a  variance  cr^  independent  of  x  so  that 
Kn=<tIL 

The  Gaussian  process  view  provides  a  unifying  framework  for  many  regression  meth¬ 
ods.  ARMA  models  used  in  time  series  analysis  and  spline  smoothing  (e.g.  [10]) 
correspond  to  Gaussian  process  prediction  with  a  particular  choice  of  covariance 
function^,  as  do  generalized  linear  regression  models  {y{x)  =  Wi(l)i{x),  with  {(pi} 
a  fixed  set  of  basis  functions)  for  a  Gaussian  prior  on  the  weights  Gaussian 

processes  have  also  been  used  in  the  geostatistics  field  (e.g.  [3],  [1]),  and  are  known 
there  as  “kriging” ,  but  this  literature  has  concentrated  on  the  case  where  x  £  IR^ 
or  IR^,  rather  than  considering  more  general  input  spaces.  Regularization  networks 
(e.g.  [8],  [2])  provide  a  complementary  view  of  Gaussian  process  prediction  in  terms 
of  a  Fourier  space  view,  which  shows  how  high-frequency  components  are  damped 
out  to  obtain  a  smooth  approximator. 


2.1  Adapting  Covariance  Functions  and  ARD 

Given  a  covariance  function  C  =  Cp  -f  Cat,  the  log  probability  /  of  the  training 
data  is  given  by 

l=-ygdetK-yK-H-^\og2w  (3) 


where  K  =  Kp  Kn-  If  C  has  some  adjustable  parameters  6,  then  we  can  carry 
out  a  search  in  ^-space  to  maximize  /;  this  is  simply  maximum  likelihood  estimation 
of  0  For  example,  in  a  d-dimensional  input  space  we  may  choose 

C{x,  x')  =  VO  exp  ^  j  +  vi6{x,  x')  (4) 

where  and  the  are  adjustable.  In  MacKay’s  terms  [4]  /  is  the  log  “evi¬ 

dence”  ,  with  the  parameter  vector  9  roughly  corresponding  to  his  hyperparameters 
a  and  /?;  in  effect  the  weights  have  been  exactly  integrated  out. 

One  reason  for  constructing  a  model  with  variable  w^s  is  to  express  the  prior  belief 
that  some  input  variables  might  be  irrelevant  to  the  prediction  task  at  hand,  and 


^Techniceilly  splines  require  generalized  covariance  functions. 

^See  section  4  for  a  discussion  of  the  hierarchical  Bayesi£in  approach. 
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Method 

No.  of  inputs 

sum  squared  test  error 

Gaussian  process 

2 

1.126 

Gaussian  process 

6 

1.138 

MacKay 

2 

1.146 

Neal 

2 

1.094 

Neal 

6 

1.098 

Table  1  Results  on  the  robot  arm  task. 


we  would  expect  that  the  w’s  corresponding  to  the  irrelevant  variables  would  tend 
to  zero  as  the  model  is  fitted  to  data.  This  is  closely  related  to  the  Automatic 
Relevance  Determination  (ARD)  idea  of  MacKay  and  Neal  [5],  [7]. 

3  Experiments  with  Gaussian  Process  prediction 

Prediction  with  Gaussian  processes  and  maximum  likelihood  training  of  the  covari¬ 
ance  function  has  been  tested  on  two  problems  ;  (i)  a  modified  version  of  MacKay’s 
robot  arm  problem  and  (ii)  the  Boston  housing  data  set. 

For  both  datasets  I  used  a  covariance  function  of  the  form  given  in  equation  4  and  a 
gradient-based  search  algorithm  for  exploring  0-space;  the  derivative  vector  dlfdO 
was  fed  to  a  conjugate  gradient  routine  with  a  line-search^ . 

3.1  The  Robot  Arm  Problem 

I  consider  a  version  of  MacKay’s  robot  arm  problem  introduced  by  Neal  (1995). 
The  standard  robot  arm  problem  is  concerned  with  the  mappings 

yi  =  ri  cos  xi  r2  cos(ari  -1-  X2)  2/2  =  ri  sin  xi  -1-  r2  sin(a;i  -1-  X2)  (5) 

The  data  was  generated  by  picking  xi  uniformly  from  [-1.932,  -0.453]  and  [0.453, 
1.932]  and  picking  X2  uniformly  from  [0.534,  3.142].  Neal  added  four  further  inputs, 
two  of  which  were  copies  of  xi  and  X2  corrupted  by  additive  Gaussian  noise  of 
standard  deviation  0.02,  and  two  further  irrelevant  Gaussian-noise  inputs  with  zero 
mean  and  unit  variance.  Independent  zero-mean  Gaussian  noise  of  variance  0.0025 
was  then  added  to  the  outputs  yi  and  2/2-  I  used  the  same  datasets  as  Neal  and 
MacKay,  with  200  examples  in  the  training  set  and  200  in  the  test  set. 

The  theory  described  in  section  2  deals  only  with  the  prediction  of  a  scalar  quantity 
y ,  so  I  constructed  predictors  for  the  two  outputs  separately,  although  a  joint 
prediction  is  possible  within  the  Gaussian  process  framework  (see  co-kriging,  §3.2.3 
in  [1]).  Two  experiments  were  conducted,  the  first  using  only  the  two  “true”  inputs, 
and  the  second  one  using  all  six  inputs.  For  each  experiment  ten  random  starting 
positions  were  tried.  The  log(i;)’s  and  log(iL')’s  were  all  chosen  uniformly  from  [-3.0, 
0.0],  and  were  adapted  separately  for  the  prediction  of  2/1  and  2/2-  The  conjugate 
gradient  search  algorithm  was  allowed  to  run  for  100  iterations,  by  which  time  the 
likelihood  was  changing  very  slowly.  Results  are  reported  for  the  run  which  gave 
the  highest  probability  of  the  training  data,  although  in  fact  all  runs  performed 

^In  fact  the  parameterization  log  9  was  used  in  the  search  to  ensure  that  the  i^’s  and  w^s  stayed 
positive. 
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Procedure  used 

ave.  squared  test  error 

Guessing  overall  mean 

84.4 

Best  result  in  Quinlan  (1993) 

10.9 

Gaussian  process 

8.6 

Neal  (Bayesian  network  with  2  hidden  layers) 

6.5 

Table  2  Results  on  the  Boston  housing  data  task. 


very  similarly.  The  results  are  shown  in  Table  1  ®  and  are  encouraging,  as  they 
indicate  that  the  Gaussian  process  approach  is  giving  very  similar  performance  to 
two  well-respected  techniques.  All  of  the  methods  obtain  a  level  of  performance 
which  is  quite  close  to  the  theoretical  minimum  error  level  of  1,0.  It  is  interesting 
to  look  at  the  values  of  the  ly’s  obtained  after  the  optimization;  for  the  task 
the  values  were  0.243,  0.237,  0.0650,  1.7  x  10“^  2.6  x  10-^  9.2  x  10“^,  and  ?;o 
and  vi  were  7.920  and  0.0022  respectively.  The  w  values  show  nicely  that  the  first 
two  inputs  are  the  most  important,  followed  by  the  corrupted  inputs  and  then  the 
irrelevant  inputs. 

3.2  Boston  Housing  Data 

The  Boston  Housing  data  has  been  used  by  several  authors  as  a  real-world  regression 
problem  (the  data  is  available  from  ftp :  //lib .  stat .  cmn .  edu/datasets).  For  each 
of  the  506  census  tracts  within  the  Boston  metropolitan  area  (in  1970)  the  data  gives 
13  input  variables,  including  per  capita  crime  rate  and  nitric  oxides  concentration, 
and  one  output,  the  median  housing  price  for  that  tract. 

A  ten-fold  cross-validation  method  was  used  to  evaluate  the  performance,  as  de¬ 
tailed  in  [9]).  The  dataset  was  divided  into  ten  blocks  of  near-equal  size  and  distri¬ 
bution  of  class  values  (I  used  the  same  partitions  as  in  [9]).  For  each  block  in  turn 
the  parameters  of  the  Gaussian  process  were  trained  on  the  remaining  blocks  and 
then  used  to  make  predictions  for  the  hold-out  block.  For  each  of  the  ten  experi¬ 
ments  the  input  variables  and  targets  were  linearly  transformed  to  have  zero  mean 
and  unit  variance,  and  five  random  start  positions  used,  choosing  the  log(ti)’s  and 
log(iL')’s  uniformly  from  [-3, 0,0.0].  In  each  case  the  search  algorithm  was  run  for 
100  iterations.  In  each  experiment  the  run  with  the  highest  evidence  was  used  for 
prediction,  and  the  test  results  were  then  averaged  to  give  the  entry  in  Table  2. 
The  fact  that  the  Gaussian  process  result  beats  the  best  result  obtained  by  Quinlan 
(who  made  a  reasonably  sophisticated  application  of  existing  techniques)  is  very 
encouraging.  It  was  observed  that  different  solutions  were  obtained  from  the  random 
starting  points,  and  this  suggests  that  an  hierarchical  Bayesian  approach,  as  used 
in  Neal’s  neural  net  implementation  and  described  in  section  4,  may  be  useful  in 
further  increasing  performance. 

®The  bottom  three  lines  of  the  table  were  obtained  from  [7],  The  MacKay  result  is  the  test 
error  for  the  net  with  highest  “evidence” . 
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4  Discussion 

I  have  presented  a  Gaussian  process  framework  for  regression  problems  and  have 
shown  that  it  produces  excellent  results  on  the  two  test  problems  tried. 

In  section  2  I  have  described  maximum  likelihood  training  of  the  parameter  vector 
6.  Obviously  a  hierarchical  Bayesian  analysis  could  be  carried  out  for  a  model  M 
using  a  prior  P{d\M)  to  obtain  a  posterior  P{6\V,  M).  The  predictive  distribution 
for  a  test  point  and  the  “model  evidence”  P{p\M)  are  then  obtained  by  averaging 
the  conditional  quantities  over  the  posterior.  Although  these  integrals  would  have 
to  be  performed  numerically,  there  are  typically  far  fewer  parameters  in  0  than 
weights  and  hyperparameters  in  a  neural  net,  so  that  these  integrations  should 
be  easier  to  carry  out.  Preliminary  experiments  in  this  direction  with  the  Hybrid 
Monte  Carlo  method  [7]  are  promising. 

I  have  also  conducted  some  experiments  on  the  approximation  of  neural  nets  (with  a 
finite  number  of  hidden  units)  by  Gaussian  processes,  although  space  limitations  do 
not  allow  me  to  describe  these  here.  Other  directions  currently  under  investigation 
include  (i)  the  use  of  Gaussian  processes  for  classification  problems  by  softmaxing 
the  outputs  of  k  regression  surfaces  (for  a  ^-class  classification  problem),  and  (ii) 
using  non-stationary  covariance  functions,  so  that  C{x,  x')  ^  C{\x  —  x'\). 

REFERENCES 

[1]  N.  A.  C.  Cressie,  Statistics  for  Spatial  Data,  Wiley  (1993). 

[2]  F.  Girosi,  M.  Jones,  andT.  Poggio,  Regularization  Theory  and  Neural  Networks  Architectures, 
Neural  Computation,  Vol.  7(2)  (1995),  pp219-269. 

[3]  A.  G.  Jouxnel  and  Ch.  J.  Huijbregts,  Mining  Geostatistics,  Academic  Press  (1978). 

[4]  D.  J.  C.  MacKay,  A  Practical  Bayesian  Framework  for  Backpropagation  Networks,  Neural 
Computation,  Vol.  4(3)  (1992),  pp448-472. 

[5]  D.  J.  C.  MacKay,  Bayesian  Methods  for  Backpropagation  Networks,  In  J.  L.  van  Hemmen, 
E.  Domany,  and  K.  Schulten,  editors,  Models  of  Neural  Networks  II,  Springer  (1993). 

[6]  R.  M.  Neal,  Bayesian  Learning  via  Stochastic  Dynamics,  in:  S.  J.  Hanson,  J.  D.  Cowan,  aaid 
C.  L.  Giles,  eds..  Neural  Information  Processing  Systems,  Vol.  5  (1993),  pp475-482.  Morgan 
Kaufmann,  San  Mateo,  CA,. 

[7]  R.  M.  Neal,  Bayesian  Learning  for  Neural  Networks,  PhD  thesis,  Dept,  of  Computer  Science, 
University  of  Toronto  (1995). 

[8]  T.  Poggio  and  F.  Girosi,  Networks  for  approximation  and  learning.  Proceedings  of  IEEE, 
Vol.  78  (1990),  ppl481-1497. 

[9]  J.  R.  Quinlan,  Combining  Instance-Based  and  Model-Based  Learning,  in:  P.  E.  Utgoff,  ed., 
Proc.  ML’93,  Morgan  Kaufmann,  San  Mateo,  CA  (1993). 

[10]  G.  Wahba,  Spline  Models  for  Observational  Data,  Society  for  Industrial  and  Applied  Math¬ 
ematics,  CBMS-NSF  Regional  Conference  series  in  applied  mathematics  (1990). 

Acknowledgements 

I  thank  Radford  Neal  and  David  MacKay  for  many  useful  discussions  and  for  gen¬ 
erously  proving  data  used  in  this  paper,  Chris  Bishop,  Peter  Dayan,  Radford  Neal 
and  Huaiyu  Zhu  for  comments  on  earlier  drafts,  George  Lindfield  for  making  his 
implementation  of  the  scaled  conjugate  gradient  search  routine  available  and  Carl 
Rasmussen  for  his  C  implementation  which  runs  considerably  faster  than  my  MAT- 
LAB  version.  This  work  was  supported  by  EPSRC  grant  GR/J75425. 


STOCHASTIC  FORWARD-PERTURBATION,  ERROR 
SURFACE  AND  PROGRESSIVE  LEARNING  IN 

NEURAL  NETWORKS 

Li-Qun  Xu 

Intelligent  Systems  Research  Group,  BT  Research  Laboratories 
Martlesham  Heath,  Ipswich,  IP 5  7RE,  UK. 

We  address  the  issue  of  progressive  learning  of  neural  networks,  focusing  on  the  situation  in 
which  the  environment  or  the  training  set  for  a  leeimer  is  of  fixed  size.^  We  concentrate  on  the 
phenomenon  of  the  change  in  shape  of  the  error  surface  of  a  neural  network  (defined  over  the 
weight  space)  as  a  result  of  presenting  it  with  a  range  of  intermediate  tasks  during  the  course 
of  learning,  and  a  goal-driven  mechanism  is  therefore  proposed  and  emalysed.  The  stochastic 
gradient  smoothing  algorithm  (SGSA)  [14,  16]  is  found  to  be  effective  in  implementing  this  idea, 
experiments  done  demonstrate  the  usefulness  of  our  approach  towards  progressive  learning. 

1  Active  Learning:  Non-fixed  vs  Fixed  Data  Set 

The  findings  in  cognitive  science  have  shown  that  a  learning  process  is  normally  a 
bidirectional  process  involving  the  interactive  actions  (information  exchange)  be¬ 
tween  a  learner  and  its  surrounding  environment  [3].  In  active  learning  of  neural 
networks,  the  learner  is  a  neural  network  of  specified  configuration  A  parameterised 
by  some  connection  weight  vector  W  G  learning  to  perform  a  particular  task 
such  as  function  mapping  [8],  pattern  classification  [11]  and  robot  control  [13]  among 
many  others.  The  environment  is  characterised  by  a  training  data  set  of  fixed  or 
non-fixed  size,  or  some  exploratory  space  of  certain  underlying  structures. 

Current  research  on  active  learning  of  neural  networks  has  been  focusing  on  the 
design  of  various  well-defined  information  and/or  generalisation  criteria  [6,  8,  13, 
9,  4]  based  on  which  the  selection,  can  be  made  from  among  a  set  of  available  data 
examples,  oi  a,  new  data  example,  which,  when  added  to  the  previous  training  set 
and  learned,  allow  the  trained  network  has  the  maximum  accuracy  of  fit  to  the 
data  and  improved  generalisation  performance  [10].  Notably,  this  strategy  imposes 
no  restrictions  on  the  size  of  the  data  set. 

However  other  approaches  to  active  learning,  given  that  the  training  data  set  is  of 
fixed  size,  follow  a  slight  different  trend  which  essentially  emphasises  the  principle 
of  progressive  learning  [11].  This  is  usually  achieved  by  letting  the  neural  network 
learn  a  succession  of  varied  subsets  that  represent,  on  one  occasion,  a  different  level 
of  abstraction  of  the  final  complex  task.  Subsequently,  the  response  of  the  learner 
to  the  changeable  environment  -  the  particular  subsets  engaged  -  is  monitored,  and 
the  performance,  in  terms  of  model  misspecification  and  network  variance,  thus  far 
achieved  will  determine  what  aspects  of  the  environment  the  learner  is  to  face  next 
time. 

There  are  various  ways  whereby  the  whole  environment  can  be  decomposed  :  one 
can  consider  to  divide  the  entire  data  set  according  to  sample  size  from  small  to 
large  [1],  the  degree  of  difficulty  from  low  to  high  [5],  [3],  etc.  This  way  of  division 
of  the  entire  data  set  is  by  and  large  based  on  the  results  of  empirical  studies  and 

^  which  is  often  the  case  in  practical  apphcations  of  neural  networks,  especially  pattern 
recognition. 
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the  understanding  of  individual  experinaenters  to  the  tasks  at  hand.  In  this  paper 
we  propose  an  alternative  way  of  performing  progressive  learning  while  freeing  from 
the  difficulties  of  artificially  decomposing  the  environment.  In  the  following  section 
we  introduce  the  view  of  progressive  learning  in  terms  of  dealing  with  the  change  in 
shape  of  error  surfaces.  In  section  3  we  propose  to  use  stochastic  gradient  smoothing 
algorithm  (SGSA)  to  implement  this  idea.  In  section  4  experiments  are  conducted 
to  demonstrate  the  usefulness  of  the  approach  for  active  learning  of  neural  networks. 
The  paper  is  concluded  in  section  5. 

2  Progressive  Learning  -  a  Goal-driven  Perspective 

It  would  be  instructive  if  we  interpret  the  progressive  way  of  learning  an  entire 
environment  by  a  neural  network  based  on  the  change  in  shape  of  its  error  sur¬ 
face  vs  connection  weights  :  Given  a  neural  network  A  specified  by  a  connection 
weight  vector  W  e  IR^,  a  training  set  0^’^^  of  \S\  pairs  of  labelled  examples, 
{xj,  describing  the  whole  environment,  an  error  function  can  then 

be  expressed  as, 

/e(s)(W)  :  (1) 

the  notation  in  the  equation  is  meant  to  show  the  fact  that  the  shape  of  the 
error  surface  in  the  weight  space  W  €  IR^  is  entirely  determined  by  the  training 
set.  (Figure  1  (a)  shows  a  two-dimensional  profile  of  such  an  error  surface  when  a 
final  convergence  has  been  reached.)  Consequently,  the  strategy  of  using  a  varied 


Figure  1  (a)  A  two-dimensional  profile  of  an  error  surface  with  multiple  valleys 

and  ridges,  (b)  A  smoothed  version  of  case  (a),  retaining  much  of  its  main  features 
while  freeing  from  details  -  steepness,  roughness  and  ravine.  In  between,  there 
exist  a  range  of  intermediate  error  surfaces  that  may  arise  from  either  of  the  two 
mechanisms  discussed  below. 


data  subset  in  session  k,  Sk  C  S,  for  the  training  of  A  will  invariably  lead 

to  the  change  in  shape  of  its  error  surface,  or  /0(sfc)(W),  Sk  C  S.  This  is  what  we 
mean  by  the  data-driven  mechanism  detailed  in  Figure  2  (a).  With  Sk  C  S 

being  properly  constructed,  it  is  expected  that  the  shape  of  the  error  surface  will 
change  from  initially  coarse  and  more  or  less  smooth  terrain  to  the  final  one  bearing 
all  the  details  but  somehow  with  less  regularities. 

We  are  interested,  however,  in  exploring  the  idea  from  an  opposite  perspective. 
In  fact,  we  can  manipulate  the  shape  of  the  error  surface  by  subjecting  it  to  some 
mathematically  smoothing  operations  to  generate  at  our  own  disposal  a  sequence  of 
intermediate  error  surfaces  to  achieve  the  same  effects  as  that  of  previous  strategy. 


Xu:  Progressive  Learning 
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Figure  2  Two  alternative  mechanisms  Figure  3  A  model-free  neural  network 

giving  rise  to  a  sequence  of  error  surfaces  learning  paradigm  for  stochastic  forward 

{Sfe}  {intermediate  tasks),  given  a  neural  pertmrbation  algorithms.  See  Note  (ii)  be- 

network,  A,  specified  by  its  weight  vector  low. 

,W  £  :  see  Note  (i)  below. 

Note  (i):  The  data-driven  mechanism  (a)  where  {Sfc  =  /0(8fc)(W),  Sk  C  S}.  The  learner  is 
faced,  each  time,  with  a  varied  training  set  describing  different  details  of  the  environment, 
and  consequently  has  to  learn  a  different  error  surface  defined  by  the  training  (sub)set 
engaged  By  learning  the  preceding  error  surfaces,  the  weight  vector  W  €  tends  to  be 
positioned  in  an  advantageous  location  to  facilitate  the  process  of  learning  the  following 
error  surfaces;  The  goal-driven  mechanism  (b)  where  {Sfc  =  /©(s)(W,^fc),  fc  =  1,  2,  •  •  •}. 
The  learner  in  (b)  is  confronted  with  the  entire  training  set  or  the  whole  environment, 
though  the  actual  task  for  the  learner  at  a  time  is  a  simplified  version  of  the  error  surface 
obtainable  by  convolving  it  with  an  appropriate  iV— dimensional  kernel  function  (p.d.f.) 
of  the  same  family  but  a  varied  “width”  Consequently,  this  strategy  also  results  in  a 
sequence  of  error  surfaces  of  different  shapes.  Note  that  in  (b),  the  convolution  operation 
is  implicitly  carried  out  in  the  course  of  learning  by  the  SGSA. 

Note  (ii):  The  environment  is  given  by  the  training  data  set  0^^^  =  (xj,  All  the 

connection  weights  in  A  are  arranged  as  a  weight  vector  W  G  IR^.  In  this  paradigm 
an  extra  dimension  of  randomness  characterised  by  the  perturbation  vectors  P  can  be 
explored. 


Figure  2  (b)  illustrates  this  idea  which  we  call  goal-driven  mechanism,  where  the 
sequence  of  error  surfaces  at  the  output  end  can  be  equally  viewed  as  those  ranging 
between  Figure  1  (b),  a  smooth  shallow  surface  with  established  prominent  charac¬ 
teristics,  and  Figure  1  (a),  a  surface  with  all  the  details  (and  idiosyncrasies).  It  is 
from  this  viewpoint  we  argue  that  the  stochastic  forward-perturbation  algorithms 
[15]  -  the  category  of  learning  algorithm  that  has  its  roots  nurtured  by  the  theory 
of  stochastic  approximation  of  nonlinear  dynamical  processes  -  can  be  employed  to 
accomplish  the  task  of  progressive  learning  without  caring  about  how  to  divide  the 
training  set. 

3  Mathematical  Analysis 

As  opposed  to  the  back-propagation  (BP)  algorithm,  the  operations  of  updating  the 
weight  vector  W  by  stochastic  forward-perturbation  algorithms  generally  follow  an 
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F  -  P  -  F  -  G  pattern  (Figure  3)  i.e.  one  forward  propagation  of  the  training  set 
through  A  to  measure  as  it  currently  stands  error  the  function;  one  perturbation 
of  the  weight  vector  with  some  controllable  noises  P  €  IR^;  and  another  forward 
propagation  to  measure  the  response  of  A  due  to  the  perturbation  just  introduced; 
based  on  these  measurements  a  gradient  estimation  can  be  taken  by  employing 
some  intuitive  difference  approximation  or  more  subtle  correlation  methods.  The 
estimated  gradient  can  then  be  incorporated  into  stochastic  approximation  algo¬ 
rithms  to  update  the  weight  vector. 

In  the  following,  the  advanced  algorithm  called  stochastic  gradient  smoothing  algo¬ 
rithm  [16]  is  employed.  According  to  Figure  2  (b),  the  following  smoothing  operation 
has  been  invoked  (Note  that  the  data- dependent  notation  has  been  dropped.) 
f(W,P)  =  /(W)®[G(-W,/?)-hG(W,^)] 

=  /  G(A,^)[/(W-f-A)  +  /(W-A)]dA  (2) 

where  the  symbol  ‘(g)’  denotes  the  convolution  operator.  Following  some  mathemat¬ 
ical  manipulations,  one  stochastic  gradient  estimator  is  shown  as  follows  : 

6  S  V„/(W,,  A)  =  6«/(W*)  (3) 

^  i=l 

where  A^-  is  a  random  (perturbation)  vector  generated  from  an  independent  Gaus¬ 
sian  p.d.f.  G(A).  The  scalar  /?jt,  referred  to  as  the  smoothing  factor,  determines  the 
width  of  the  underlying  p.d.f.  or  the  perturbation  strength  of  the  generated  random 
vectors.  Finally,  the  difference  in  the  cost  function  due  to  the  \th  perturbation  of 
Wfc  is  (56)y(Wfc).  Based  on  the  estimated  gradient  the  weight  vector  Wfc  is 
updated  according  to, 


djfc  =  P/fc  •  6  +  (1  -  Pk)  •  djt_i. 

(4) 

Wk+i  =Wk~-r]k’dk 

(5) 

where  the  direction  vector  d*  =  (dki  dk2  •  “  dkN)'  G  IR^  represents  a  running 
average  over  the  current  gradient  estimate  ^k  and  the  previous  direction  vector 
djt_i.  The  other  two  important  parameters  appeared  in  the  above  formulas  include 
r)k  -  the  learning  rate  (or  adaptive  step  size)  and  pk  -  the  accumulating  factor.  It 
is  desirable  that  with  the  iteration  k  ^  oo,  rjk  0,  pk  1.  The  parameters  rjk 
and  pk  are  locally  adaptable. 

4  Experimental  Results 

A  simulated  two-class  pattern  classification  problem,  of  which  the  details  can  be 
found  in  [7],  is  used  to  demonstrate  our  approach  for  active  learning  of  neural 
networks.  The  distributions  of  these  two  classes  (Ci  and  Cq)  are  overlapped,  and  a 
complete  separation  of  their  examples  is  impossible  even  for  an  optimal  classifier. 
In  the  experiments,  the  training  set  consists  of  200  examples  with  100  drawn  from 
each  clciss,  while  the  test  set  contains  1000  examples  with  500  belonging  to  each 
class.  The  neural  network  used  for  this  task  has  a  fully-connected  three-layer  2-4 
-  1  structure,  amounting  to  17  weights  including  biases.  The  single  output  of  the 
trained  network  should  ideally  be  for  examples  in  Ci  a  “1”  and  in  Go  a  “0”  when 
being  shown  the  unseen  data  in  the  testing  phase. 


Figure  4  (a)  A  typical  learning  trajectory  achieved  by  the  SGSA.  The  vertical 

dashed  lines  mark  the  boundaries  where  a  new  intermediate  error  surface,  char¬ 
acterised  by  a  different  value  of  Pk  shown  along  the  lines,  is  encountered  by  the 
learner.  In  this  case,  there  are  4  intermediate  error  surfaces  (where  0k  assumes  a 
value  of  2.6,  1.0,  0.5  and  0.25  in  order)  to  be  learned  before  the  learner  faces  the 
final  task  (where  0k  =  0.1)  which  is  approximate  to  the  error  surface  imposed  by 
the  complete  environment  for  which  0k  =  0.  (b)  A  set  of  10  learning  trajectories 
for  the  test  problem  by  the  SGSA  with  different  initial  weight  vectors. 


For  the  problem,  10  trials  are  performed,  each  starting  with  a  different  weight  vector 
Wo  having  random  values  ranging  between  ±0.5.  Figure  4  (a)  shows  a  typical 
learning  trajectory,  the  mean  squared  error  vs  number  of  iterations,  achieved  by 
the  SGSA. 

Table  1  summarises  the  average  performance  among  ten  trials  for  the  SGSA,  where 
Etr  denotes  the  mean  squared  error  (MSE)  for  the  training  set;  Utr  gives,  up  to 
the  indicated  iterations,  the  number  of  examples  yet  to  be  correctly  learned  in 
the  training  set  (200  examples  in  total);  rite  shows  the  generalisation  performance 
measured  by  the  number  of  examples  that  are  still  misclassified,  up  to  the  given 
training  iterations,  in  the  test  set  (1000  examples  in  total).  An  important  result 
is  that  the  generalisation  performance  given  by  the  nte  tends  to  saturate  rather 
than  deteriorate  with  more  iterations,  as  is  not  the  case  with  the  experiments  using 
deterministic  Quick-prop  algorithm  (an  advanced  version  of  the  BP  algorithm)  [15]. 
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Algo. 

Iter. 

700 

800 

900 

1000 

1100 

1200 

1300 

1400 

1500 

SGSA 

Etr 

2.658 

2.551 

2.477 

2.410 

2.367 

2.313 

2.290 

2.261 

2.229 

ntr 

11.4 

9.9 

9.7 

9.6 

8.9 

8.6 

8.3 

8.4 

8.3 

nte 

66.9 

66.0 

64.6 

62.2 

62.1 

62.5 

63.2 

61.9 

61.9 

Table  1  The  average  performance  among  10  trials  of  the  SGSA  at  selected  iter¬ 
ations  when  applied  to  the  simulated  two-class  pattern  classification  problem. 


Thus  the  SGSA  compares  favourably  with  the  Quick-prop  in  this  regard.  Figure  4 
(b)  describes  the  corresponding  set  of  10  learning  trajectories  obtained. 

5  Summary 

An  alternative  way  has  been  explored  for  active  learning  of  neural  networks,  fo¬ 
cusing  on  the  viewpoint  of  interpreting  the  error  surface  imposed  by  the  whole 
training  set  (a  certain  task)  in  terms  of  a  range  of  its  intermediate  versions,  or  a  set 
of  intermediate  tasks.  We  analysed  two  different  perspectives,  called,  respectively, 
data-driven  mechanism  and  goal-driven  mechanism,  that  give  rise  to  such  a  range 
of  error  surfaces.  We  argued  that  the  later  approach  could  well  be  an  effective  way 
to  control  the  amount  of  information  about  a  complex  environment  at  a  time  acces¬ 
sible  to  a  learner,  therefore  fulfilling  the  same  objective  (of  progressive  learning)  as 
the  former  without  explicitly  decomposing  the  environment  in  a  hard  fashion.  The 
stochastic  gradient  smoothing  algorithm  (SGSA)  was  employed  to  implement  this 
idea.  Experiments  have  been  conducted  to  support  our  claims.  Further  theoretical 
studies  of  this  issue  are  needed. 
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The  convergence  and  the  convergence  rate  of  one  self- organization  algorithm  for  the 
high-dimensional  self- organizing  map  in  the  linearized  Amari  model  are  studied  in  the  context 
of  stochastic  approximation.  The  conditions  on  the  neighborhood  function  and  the  learmng  rate 
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1  Introduction 

The  self-organizing  neural  networks  were  found  in  modeling  some  self-organized 
phenomena  in  nervous  systems.  The  typical  models  are  those  given  by  Willshaw  and 
von  der  Malsburg  (1976)[14],  Grossberg  (1976)[8],  Amari  (1980)[1],  and  Kohonen 
(1982)  [9].  These  self-organization  systems  can  alter  their  internal  connections  to 
represent  the  statistical  features  and  topological  relations  in  an  input  space. 

The  self-organizing  maps  (SOMs)  are  expressed  by  the  weights  of  the  self-organizing 
networks.  They  are  very  effective  in  modeling  the  topographic  mapping  among  neu¬ 
ral  fields.  Although  the  SOMs  are  widely  used  in  neural  modeling  and  applications, 
the  theory  of  the  SOMs  is  far  from  being  complete.  The  question  whether  the  self¬ 
organizing  maps  converge  to  the  topological  correct  mappings  is  still  open  especially 
in  the  high  dimensional  case. 

There  are  many  models  for  the  SOMs.  Due  to  the  space  limit  we  only  consider 
Amari’s  nerve  field  model[l,  2].  Another  important  SOM  model  is  the  feature 
map  [9].  The  stability  of  the  feature  map  is  discussed  in  [3,  6,  7,  10,  12]. 

The  stability  of  the  SOM  in  the  nerve  field  model  is  analyzed  in  [1,  5,  13,  15]. 
The  existing  convergence  results  are  most  for  the  one-dimensional  model.  It  is  very 
difficult  to  analyze  the  stability  of  the  SOM  in  any  high  dimension.  The  results  in 
[5,  15]  are  only  valid  for  some  special  cases  of  the  linearized  Amari  model.  We  shall 
formulate  the  linearized  Amari  model  in  any  high  dimension  and  discuss  not  only 
the  convergence  but  also  the  convergence  rate  of  the  self-organization  algorithm  for 
updating  the  weights.  The  approach  used  can  also  be  used  to  analyze  the  stability 
of  the  feature  map. 

2  High-dimensional  Self-organizing  Maps 

Let  the  K-dimensional  grid  Ck  =  {0?  1>  •  • '  >  of  (A/’-f- 1)^  neurons  be  the  presy- 
naptic  field  in  the  Amari  model.  A  neuron  in  Cr  is  denoted  by 

J  =  (ji,  •  "^3k)  G  Ck- 

Let  Wk  be  the  space  of  all  mappings  VK  =  W(J)  :  Cn  C  72”  where  each 

W{J)  is  a  column  vector  in  72".  The  weight  vectors  in  YJk  are  labeled  by  the 
neurons  in  Ck- 

2.1  High  Dimensional  Topographic  Map 

Let  us  consider  a  system  consisting  of  a  presynaptic  field  Cr  and  a  postsynaptic  field 
Cn-  The  neighborhood  relation  between  each  pair  of  neurons  in  Cr  is  determined  by 


389 


390 


Chapter  68 


their  relative  positions  in  Lk-  A  high  dimensional  topographic  map  is  an  ordered 
mapping  in  under  which  the  neighborhood  relation  in  Ck  is  preserved  in  Cn. 
To  achieve  a  topographic  map,  we  choose  a  random  mapping  in  Wr,  then  use  the 
following  algorithm  to  update  the  mapping  recursively: 


(1  -  \t)Wt{J)  +  ^  J  €  Nc(t)  n 

JeC%-  Ncit) 

B{J)  (boundary  condition),  J  g  6Ck 


(1) 


where 

\t  is  a  learning  rate,  =  {1,  •  •  • ,  A  -  1}^  the  inner  lattice  of  Cr, 

8Cr  =  Cr  —  Nc{t)  =  a  neighborhood  set, 

h  =  (t(0)  ■  ‘  )  ^K{t))  a  random  stimulus  process  on  Cr, 

(Tt{I)  a  neighborhood  set  around  I  such  that  <Tt{I)  C  Cr  for  each 
I  e  Cr  and  t,  and  Ct  the  number  of  points  in  the  set  Nc{t). 

The  equation  (1)  is  a  linearized  version  of  the  learning  equation  in  [1].  It  does 
not  update  the  mapping  on  the  boundary  of  Cr.  But  the  boundary  condition  will 
affect  the  map  on  C^  (the  inner  lattice)  whenever  the  neighborhood  set  touches 
the  boundary  of  the  grid  Cr,  and  eventually  shape  the  topographic  map 
Let  h,{I,J)  =  ^he  characteristic  function  of  the  set  (yt{I)f]C^. 

The  equation  (1)  can  be  rewritten  as  the  following: 

+^ht{I„J)J2h,{It,J')BiJ')  (2) 

^  J'£6Ck 

where  Ct  =  J).  Note  from  the  definition  of  Ct,  it  is  easy  to  show  that 

Wt{J)  is  bounded. 

In  the  rest  of  this  paper,  we  assume  a  general  neighborhood  function  ht{I,  J)  >  0 
on  Cr  X  Cr  in  (2).  Let  the  neurons  in  C^j^  be  arranged  in  the  dictionary  order 
by  their  indexes  in  where  h  =  (1,  ■••,!),  J2  =  (1,---,1,2), 

•  •  ‘  1  Jm  —  —  1,  •  ■  ■ ,  TV  —  1)  and  m  =  (AT  ~  1)-^.  Each  mapping 

is  a  n  X  m  matrix.  Using  w\  to  denote  the  transpose  of  the  i-th  row  of  Wt,  we  find 
the  following  vector  form  for  (2): 

^5+1  =  -  >^tUtw\  +  —utujw]  +  Xfbi  (3) 

where  (•)^  denotes  the  transpose,  u,  =  7i),  Jj),  •  • Ut  = 

diag(ut),  b\  = 

To  analyze  the  convergence  of  the  algorithm  (2),  we  rewrite  the  system  (3)  as; 

"'i+i  =  -  HQtw\  -  b\)  =  w\  -  \t(Qtw\  -b\  +  (4) 

where  Qt  =  U,  -  Q,  =  £■[(?,],  b\  =  E[bi],  =  (Q,  -  Q,)wi  -  {b\  -  b\),  and 

Ct  is  a  martingale-difference  process,  i.e.,  E[C\Et-i]  =  0  for  Et  =  (t{Io,  •  •  • ,  /,). 

We  need  the  following  assumptions: 
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A1  ht{I,  J)  —*■  h(Iy  J)  and  — >•  c  (a  positive  constant)  as  t  — >■  oo. 

A2  It  is  a  independent  random  sequence  in  with  the  probability  distribution 


Denote 

1  ^ 

Q{I)  =  diag{u(I)) - u{I)u(I)'^ ,  Q  =  ^Q{Jk)P{Jk), 

^  jk=i 

m 

^  J'escK  k=i 


From  (4),  we  have 

=  (5) 

where  yl  =  {Qt  ~  Q)wi  -  (6$  -  6*)  — >■  0  as  f  — >  oo  because  of  Al. 

Let  be  the  stationary  state  of  the  system  (5).  The  next  theorem  shows 

that  Wf  almost  surely  converges  to  M  =  (m\  •  •  • ,  m”). 


Theorem  1  Under  the  assumptions  Al  and  A2,  ifQ>Q  (positive  definite)  and 
the  learning  rate  At  >  0  satisfies  the  following  conditions: 

Y^\t  =  oo,  (6) 

i  t 

then  for  i  =  I,  •  •  •  ^n,  wl  rn^  a.s.  (almost  sure  convergence). 

Proof  Let  Xt  be  generated  by  the  following  equation:  Xt+i  =  xt-Xt(Q(xt-m))-\-^l). 
Let  h{x)  =  Q{x  -  m*),  then  (a:  -  m^yQh{x)  =  (x  -  m^)'^Q'^Q{x  -  m*)  >  0, 
Wx  ^  m\  Therefore,  applying  Gladyshev’s  Theorem  (see  Theorem  2.2  in  [4]),  we 
have  Xt  m^,  a.s. 

Let  yt  =  wi  ~  Xt,  then  yt  satisfies  yt+i  =  yt  -  Xt(Qyt  +  yl)-  Since  Q  >  0  and  yl 
tends  to  zero  as  t  tends  to  infinity,  yt  also  tends  to  zero.  Therefore,  w]  — ^  m*,  a.s. 
□ 

Note  an  example  is  given  in  [15]  where  Q  >  0.  When  the  first  condition  in  (6)  is 
not  satisfied,  the  map  may  still  converge  but  not  to  the  desired  stationary  state. 
We  shall  use  an  approach  in  [11]  to  discuss  the  convergence  oiWt-  7  IZs-o 
assuming  that  the  learning  rate  satisfies  one  of  the  following  conditions: 

Af  =  A,  0  <  A  <  2(min/Xi((3))“^  (7) 

i 

AtiO,  =  o{\,)  (8) 

where  {yi{Q)}  are  eigenvalues  of  Q.  Note  the  learning  rate  At  oc  with  0  <  a  <  1 

satisfies  the  condition  (8).  _ 

The  next  theorem  gives  the  convergence  rate  of  the  averaged  weight  Wt-  It  also 

_ q’ 

shows  that  Wt  converges  to  M  in  mean  square. 


Theorem  2  Let  Q  >  0,  the  learning  rate  satisfy  (7)  or  (8),  and  the  following 
conditions  hold: 
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A3  y>  Qt-Q  =  and  l\-b’  =  0(t-^); 

A4  for  each  J,  sup,  \ht(Jk,J)-ht{J)f  P{Jk)  <  oo,  where  ^  >  2  andUJ)  = 

ET=iMJk.J)PW; 

AS  for  eachJ  and  J' ,  suptET=i  IMJk,  J)hi(h,  J')  -ht(J,J')\^  P{Jk)  <  cxi, 
where  J')  =  ET=i  ht{Jk,J)h,{J,,  J')P(Jfc)  <  oo; 

A6  /im,^ooPKj(e;)T']  =  S‘  >  0. 

Denote  V  =  Q~^S'Q-'^.  Then  for  u)J  =  i  w',,  we  have 

Vi{wi  -  nd)  Z.  N{0,  V) 

limt^coE[t{wl  -  m'){w\  -  =  V.  (9) 

Proof  From  A4  and  AS,  we  have  sup,  E'dQ,  -  Q,^]  <  oo  and 

sup,  E^b\  —  b\\t^]  <  oo.  Therefore, 

sup£'[|C;|'’]  <  oo.  (10) 

Because  tuj  is  bounded,  we  have  sup,  £'[|^,*p|:F,_i]  <  oo.  From  (10),  we  have 
Ji^lim,^ooP[K|^l(|f;|>C)|P(-l]  =  0,  a.s. 

So  Assumptions  2. 1-2. 5  in  [11]  are  satisfied.  However,  we  cannot  apply  the  results 
there  directly  due  to  the  term  jj]  in  the  system  (5).  We  can  use  the  same  approach 
in  [11]  to  get  the  following: 

ViA,  =  ^o,A„  -  ^  2 +  'll)  -  E  +  ni),  (11) 

where  At  =  w]  -  m\  and  Aq  ^  ~  m\  ott  =  Vf  =  a]  -  Q-\  with  a)  = 

EL)  YTk=j+i{^  ~  >^kQ),  (Zt  and  VJ  are  bounded,  and  lim^_oo  j  Ey=i  ll^^ll  =  0- 
Noticing  the  assumption  A3  implies  rjl  = 

and  —  -0,  ast^oo, 

we  can  apply  the  proofs  in  [11]  to  (11)  and  conclude  that  —  m*)  is  asymp¬ 

totically  normal  with  zero  mean  and  the  covariance  matrix  V,  and  (9)  holds.  □ 
Theorem  2  allows  us  to  use  a  small  constant  learning  rate  or  even  the  learning  rate 
Xt  oc  t  with  0  <  a  <  i  which  tends  to  zero  slower  than  The  assumption 
/?  >  2  in  A4  and  A5  can  be  relaxed  to  /?  =  2  if  we  only  want  to  obtain  (9). 

Note  we  showed  the  convergence  of  the  SOM  but  we  did  not  show  that  the  stationary 
states  are  the  topographic  mappings  which  preserves  the  topology.  Based  on  the 
simulation  results  in  [5,  13]  we  believe  that  the  SOM  converges  to  a  topographic 
map,  or  a  micro-structure  consisting  of  several  topographic  sub-maps. 

3  Conclusion 

The  convergence  of  the  learning  algorithm  for  updating  the  SOM  is  studied.  The 
conditions  on  the  neighborhood  function  and  the  learning  rate  have  been  found  to 
guarantee  the  convergence.  The  learning  algorithm  can  be  accelerated  by  averaging 
the  weight  vectors  in  the  training  history.  For  the  learning  algorithm  with  averaging, 
the  convergence  rate  has  been  found. 
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Neural  networks  are  statistical  models  and  learning  rules  are  estimators.  In  this  paper  a  theory 
for  measuring  generalisation  is  developed  by  combining  Bayesian  decision  theory  with  information 
geometry.  The  performance  of  an  estimator  is  measured  by  the  information  divergence  between 
the  true  distribution  and  the  estimate,  averaged  over  the  Bayesim  posterior.  This  unifies  the 
majority  of  error  measures  currently  in  use.  The  optimal  estimators  also  reveal  some  intricate 
interrelationships  among  information  geometry,  Banach  spaces  and  sufficient  statistics. 

1  Introduction 

A  neural  network  (deterministic  or  stochastic)  can  be  regarded  as  a  parameterised 
statistical  model  P{y\x,w),  where  a:  6  X  is  the  input,  t/  6  V  is  the  output  and 
u;  E  VF  is  the  weight.  In  an  environment  with  an  input  distribution  P{x),  it  is 
also  equivalent  to  P{z\w),  where  z  :=  [x,y]  e  Z  X  x  Y  denotes  the  combined 
input  and  output  as  data  [11].  Learning  is  the  task  of  inferring  w  from  It  is 
a  typical  statistical  inference  problem  in  which  a  neural  network  model  acts  as  a 
“likelihood  function”,  a  learning  rule  as  an  “estimator”,  the  trained  network  as 
an  “estimate”  and  the  data  set  as  a  “sample”.  The  set  of  probability  measures 
on  sample  space  Z  forms  a  (possibly  infinite  dimensional)  diiferential  manifold 
V  [2,  16].  A  statistical  model  forms  a  finite-dimensional  submanifold  Q,  composed 
of  representable  distributions,  parameterised  by  weights  w  acting  as  coordinates. 
To  infer  w  from  z  requires  additional  information  about  u;.  In  a  Bayesian  frame¬ 
work  such  auxiliary  information  is  represented  by  a  prior  P{p),  where  p  is  the  true 
but  unknown  distribution  from  which  z  is  drawn.  This  is  then  combined  with  the 
likelihood  function  P(z\p)  to  yield  the  posterior  distribution  P{p\z)  via  the  Bayes 
formula  P(p\z)  —  P{z\p)P(p)/ P{z). 

An  estimator  r  :  Z  Q  must,  for  each  z,  fix  one  q  £  Q  which  in  a  sense  approxi¬ 
mate  p.  ^  This  requires  a  measure  of  “divergence”  D(p,  q)  between  p,q  defined 
independent  of  parameterisation.  General  studies  on  divergences  between  probabil¬ 
ity  distributions  are  provided  by  the  theory  of  information  geometry  (See  [2,  3,  7] 
and  further  references  therein).  The  main  thesis  of  this  paper  is  that  generalisation 
error  should  be  measured  by  the  posterior  expectation  of  the  information  diver¬ 
gence  between  true  distribution  and  estimate.  We  shall  show  that  this  retains  most 
of  the  mathematical  simplicity  of  mean  squared  error  theory  while  being  generally 
applicable  to  any  statistical  inference  problems. 

^Some  Bayesian  methods  give  the  entire  posterior  P{p\z)  instead  of  a  point  estimate  q  as  the 
answer.  They  wiU  be  shown  later  to  be  a  special  case  of  the  current  framework. 
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2  Measurements  of  Generalisation 

The  most  natural  “information  divergence”  between  two  distribution  p,q  ^  V  is 
the  (^-divergence  defined  as  [2]  ^ 

Ds{p,  q)  :=  (l-j e  (0, 1).  (1) 

The  limits  as  6  tends  to  0  and  1  are  taken  as  definitions  of  Dq  and  Di ,  respectively. 
Following  are  some  salient  properties  of  the  ^-divergences  [2]: 


Esip^q)  = 

Di-6{q,p)>0.  Ds(p,q)  =  0  p-q. 

(2) 

Do{q,p)  - 

Di{p,q)  =  K{p,q)  J p\og^. 

(3) 

Ei/2{p,q)  = 

A/2(?,p)  =  2  J  {y/p-^/qf. 

(4) 

Ds{p,p-\-Ap)  « 

(5) 

The  quantity  K{p,  q)  is  the  Kullback-Leibler  divergence  (cross  entropy).  The  quan¬ 
tity  Di/2(p,<i)  is  the  Hellinger  distance.  The  quantity  f(Ap)^/p  is  usually  called 
the  distance  between  two  nearby  distributions. 

Armed  with  the  ^-divergence,  we  now  define  the  generalisation  error 

E({t):=  /  P{p)  f  P{z\p)Ds{p,t(z)),  Ei{q\z)  :=  f  P{p\z)D((p,q),  (6) 

Jp  Jz  •Jp 

where  p  is  the  true  distribution,  r  is  the  learning  rule,  z  is  the  data,  and  q  — 
t{z)  is  the  estimate.  A  learning  rule  r  is  called  (5-optimal  if  it  minimises  Es{t). 
A  probability  distribution  q  is  called  a  (5-optimal  estimate,  or  simply  a  (5-estimate, 
from  data  z,  if  it  minimises  E6{q\z).  The  following  theorem  is  a  special  case  of  a 
standard  result  from  Bayesian  decision  theory. 

Theorem  1  (Coherence)  A  learning  rule  r  is  6-optimal  if  and  only  if  for  any 
data  z,  excluding  a  set  of  zero  probability,  the  result  of  training  q  —  r(z)  is  a 
6-estimate. 

Definition  2  (^-coordinate)  Let  /i  :=  1/5,  1/(1  —  5).  Let  L^  be  the  Banach 

space  of  pth  power  iniegrable  functions.  Then  Lfj,  and  L^  are  dual  to  each  other  as 
Banach  spaces.  Let  p  £  V.  Its  6-coordinate  is  defined  as  ls{p)  :=  p^ / 6  G  for 
5  >  0,  and  lo{p)  :=  logp  [2].  Denote  by  li^s  inverse  of  Is. 

Theorem  3  (5-estimator  in  V)  The  6-estimate  q  ^  V  is  uniquely  given  [If]  by 
9  ~  h/ii!  P(pWi{p))- 

3  Divergence  between  Finite  Positive  Measures 

One  of  the  most  useful  properties  of  the  least  mean  square  estimate  is  the  so  called 
MSE  —  VAR  -f  BIAS^  relation,  which  also  implies  that,  for  a  given  linear  space 
W,  the  LMS  estimate  of  w  within  W  is  given  by  the  projection  of  the  posterior 
mean  w  onto  W.  This  is  generalised  to  the  following  theorem  [16],  applying  the 
generalised  Pythagorean  Theorem  for  5-divergences  [2]. 

^This  is  essentially  Amari’s  or-divergence,  where  a  6  [— l»l]j  re-parameterised  by  <5  =  (1  — 
a)/2  G  [0,1]  for  technical  convenience,  following  [6]. 
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Theorem  4  (Error  decomposition  in  Q)  Let  Q  be  a  6-flat  manifold.  Let  P{p) 
be  a  prior  on  Q.  Then  6  Q,  G  Z, 

Es{q\z)  ^  E^i^z)  +  Ds{p,  q),  (7) 

where  p  is  the  6-estimate  in  Q. 

To  apply  this  theorem  it  is  necessary  to  extend  the  definition  of  (5-divergence  to  V, 
the  space  of  finite  positive  measures,  which  is  (5-fiat  for  any  6  for  a  finite  sample 
space  Z  [2] ,  following  suggestions  in  [2] . 

Definition  5  (6- divergence  on  “P)  The  6-divergence  on  V  is  defined  by 

(«P+(1 (8) 

This  definition  retains  most  of  the  important  properties  of  (5-divergence  on  P,  and 
reduces  to  the  original  definition  when  restricted  to  P.  It  has  the  additional  ad¬ 
vantage  of  being  the  integral  of  a  positive  measure,  making  it  possible  to  attribute 
the  divergence  between  two  measures  to  their  divergence  over  various  events  [16]. 
In  particular,  the  generalised  cross  entropy  is  [16] 

Eip,q):=  J  (^q-p  +  p\og~^  .  (9) 

The  (5-divergence  defines  a  differential  structure  on  P.  The  Riemannian  geometry 
and  the  (5-affine  connections  can  be  obtained  by  the  Eguchi  relations  [2,  7]  The  most 
important  advantage  of  this  definition  is  that  the  following  important  theorem  is 
true  and  can  be  proved  by  pure  algebraic  manipulation  [16]. 

Theorem  6  (Error  Decomposition  on  P)  Let  P{p)  be  a  distribution  over  P. 
Let  q  eV.  Then 

{Dsip,q))  =  {Ds(p,p))  +  De(p,q),  (10) 

where  p  is  the  6-average  of  p  given  by  ^  :=  (p^). 

Theorem  7  (^-estimator  in  P)  The  6-esiimate  p  =  ts{z)  in  P  is  given  by  p^  = 
{p^)z-  7/i  particular,  the  1-estimate  is  the  posterior  marginal  distribution  p  —  {p)z- 

Theorem  8  (6-estimator  in  Q)  Let  Q  be  an  arbitrary  submanifold  ofV.  The  6- 
estimate  q  in  Q  is  given  by  the  6-projection  ofp  onto  Q,  where  p  is  the  6-esiimate 
in  V. 

4  Examples  and  Applications  to  Neural  Networks 

Explicit  formulas  are  derived  for  the  optimal  estimators  for  the  multinomial  [15] 
and  normal  distributions  [14]. 

Example  1  Let  m  G  IN",  p  G  P  =  A”“^,  a  G  IR+.  Consider  multinomial  fam¬ 
ily  of  distributions  M(m\p)  with  a  Dirichlet  prior  D{p\a).  The  posterior  is  also 
a  Dirichlet  distribution  D(p\a  m).  The  6-esiimate  p  E  V  is  given  by  (piY  = 
(oj  -h  mi)s/{\a  -b  m\)^,  where  |a|  :=  a*  and  (a)j  :=  r(a  +  6)/r(a).  In  particular, 
Pi  —  {ai-{-mi)l\a-\-m\  for  6  =  1,  and  pi  =  exp  (^(a*  rm)  -  ^(|a  -f  m|))  for  6  =  0, 
where  ^  is  the  the  digamma  function.  The  6-estimate  q  eV  is  given  by  normalising 
P- 
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Example  2  Let  z ,  fi  e  he  ni+,  a  6  IR,  n  e  ni+.  Consider  the  Gaussian 
family  of  distributions  f{z\n)  =  N{z  —  with  fixed  variance  —  1/h.  Lei  the 
prior  he  another  Gaussian  f(p,)  =  N{p,  —  a\nh),  Then  the  posterior  after  seeing 
a  sample  z  of  size  k,  is  also  a  Gaussian  f{pi\z)  =  N{pL  -  where  — 

n  +  ib,  Ok  —  (na  4- 2)/nfc,  which  is  also  the  posterior  least  squares  estimate.  The 
^-estimate  qeV  is  given  by  the  density  f{z'\q)  -  N  [Z  ~  ak\h/{l  +  b/uk)). 

The  entities  |a|  for  the  multinomial  model  and  n  for  the  Gaussian  model  are  effective 
previous  sample  sizes,  a  fact  known  since  Fisher’s  time.  In  a  restricted  model,  the 
sample  size  might  not  be  well  reflected,  and  some  ancillary  statistics  may  be  used 
for  information  recovery  [2] . 

Example  3  In  some  Bayesian  methods,  such  as  the  Monte  Carlo  method  [10],  no 
estimator  is  explicily  given.  Instead,  the  posterior  is  directly  used  for  sampling  p. 
This  produces  a  prediction  distribution  on  test  data  which  is  the  posterior  marginal 
distribution.  Therefore  these  methods  are  implicitly  1-estimators. 

Example  4  Multilayer  neural  networks  are  usually  not  6-convex  for  any  6,  and 
there  may  exist  local  optima  of  Es(-\z)  on  Q.  A  practical  learning  rule  is  usu¬ 
ally  a  gradient  descent  rule  which  moves  w  in  the  direction  which  reduces  Es{q\z). 
The  1-divergence  can  be  minimised  by  a  supervised  learning  rule,  the  Boltzmann 
machine  learning  rule  [1].  The  0-divergence  can  be  minimised  by  a  reinforcement 
learning  rule,  the  simulated  annealing  reinforcement  learning  rule  for  stochastic 
networksflSj. 

MmqK{p,q)  Aw  {dwlo{q))p  —  0‘U,gledwlo{q))q  (11) 

Min5/f(g,p)  Aw  {duiloiq),  loip)  -  ^o{Q))q  (12) 

5  Conclusions 

The  problem  of  finding  a  measurement  of  generalisation  is  solved  in  the  framework 
of  Baysian  decision  theory,  with  machinery  developed  in  the  theory  of  information 
geometry. 

By  working  in  the  Bayesian  framework,  this  ensures  that  the  measurement  is  inter¬ 
nally  coherent,  in  the  sense  that  a  learning  rule  is  optimal  if  and  only  if  it  produces 
optimal  estiamtes  for  almost  all  the  data.  By  adopting  an  information  geometric 
measurement  of  divergence  between  distributions,  this  ensures  that  the  theory  is 
independent  of  parameterisation.  This  resolves  the  controversy  in  [8,  12,  9]. 

To  guarantee  a  unique  and  well-defined  solution  to  the  learning  problem,  it  is  nec¬ 
essary  to  generalise  the  concept  of  information  divergence  to  the  space  of  finite 
positive  measures.  This  development  reveals  certain  elegant  relations  between  in¬ 
formation  geometry  and  the  theory  of  Banach  spaces,  showing  that  the  dually-affine 
geometries  of  statistical  manifolds  are  in  fact  intricately  related  to  the  dual  linear 
geometries  of  Banach  spaces. 

In  a  computational  model,  such  as  a  classical  statisitical  model  or  a  neural  network, 
the  optimal  estimator  is  the  projection  of  the  ideal  estimator  to  the  model.  This 
theory  generalises  the  theory  of  linear  Gaussian  regression  to  general  statistical 
estimation  and  function  approximation  problems.  Further  research  may  lead  to 
Kalman  filter  type  learning  rules  which  are  not  restricted  to  linear  and  Gaussian 
models. 
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This  paper  marks  a  step  towards  an  algebraic  theory  of  neural  networks.  In  it,  a  new  notion  of 
sequential  composition  for  neural  networks  is  defined,  and  some  observations  are  made  on  the 
eilgebraic  structure  of  the  class  of  networks  under  this  operation.  The  composition  is  shown  to 
reflect  a  notion  of  composition  of  state  spaces  of  the  networks.  The  paper  ends  on  a  very  brief 
exposition  of  the  usefulness  of  similar  composition  theories  for  other  models  of  computation. 

1  What  is  an  Algebraic  Study? 

The  view  of  mathematics  that  underlies  this  work  is  one  forged  by  Category  Theory 
(see  for  example  [1]).  In  Category  Theory  scrutiny  is  focused  less  on  objects  than  on 
mappings  between  them.  Thus  the  natural  question  is;  “What  are  the  mappings?”. 
In  this  kind  of  study  the  answer  tends  to  be:  “Functions  (or  sometimes  partial 
functions)  that  preserve  the  relevant  operations.”  So  before  we  can  start  looking 
at  the  category  of  neural  networks  we  need  to  answer  the  question:  “What  are  the 
relevant  operations?” 

For  example,  recall  that  a  group  is  a  set,  G,  with  a  binary  operation  (*),  a  constant 
(e),  and  a  unary  operation  ()“^.  To  be  a  group,  it  is  also  necessary  that  these 
operations  satisfy  some  equations  to  the  effect  that  *  is  associative,  e  is  a  two-sided 
identity  for  *  and  ()“^  forms  inverses,  but  these  are  the  defining  group  operations. 
A  group  homomorphism  is  a  function,  (j),  from  one  group  to  another  satisfying 

<^(a  *  6)  =  ^(a)  *  ^(6),  ^(e)  and  ^(a)""^  =  (^(a))"^. 

That  is  a  group  homomorphism  is  defined  to  be  a  function  that  preserves  all  of  the 
defining  operations  of  a  group. 

2  What  are  the  Defining  Operations  of  Neural  Nets? 

A  synchronous  neural  net  is  a  set  of  nodes  each  one  of  which  computes  a  function; 
the  input  for  the  function  at  a  node  at  a  given  time  is  the  output  of  some  nodes 
at  the  previous  time  step  (the  nodes  that  supply  the  input  to  the  given  node  is 
fixed  for  all  time  and  determined  by  the  interconnection  structure  of  the  net).  This 
structure  is  captured  diagrammatically  by  drawing  a  circle  for  each  node  and  an 
arrow  from  node  N  to  node  M  if  M  acts  £is  an  input  for  N.  For  example,  the 
following  is  a  picture  of  a  two-noded  net  each  of  whose  nodes  takes  an  input  from 
node  /,  and  g  takes  another  input  from  itself: 


Figure  1 

The  letters  indicate  that  node  on  the  left  computes  the  function  f{n)  and  the  node 
on  the  right  computes  function  g(n,m).  So  a  neural  net  is  a  finite  directed  graph 
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each  of  whose  nodes  is  labelled  by  a  function.  We  will  call  the  underlying  unlabelled 
graph  the  shape  of  the  net. 

As  well  as  the  shape  (the  interconnection  structure),  there  is  a  recursive  computa¬ 
tional  structure  generated  by  the  network.  In  this  example  we  get  the  two  recursive 
functions: 

/n  +  l  ~  f{fn)- 

9n-\-l  ~  9ifnj9n)- 

where,  for  example,  fi  denotes  the  state  of  the  node  labelled  by  /  at  time  i.  Once 
/  and  g  are  known,  these  equations  entirely  determine  the  dynamics  of  the  system: 
that  is,  the  sequence  of  values  of  the  pair  <  fi,9i  >. 

3  The  State  Space  of  a  Boolean  Net 

In  this  section  we  will  restrict  ourselves  to  nets  whose  nodes  compute  boolean 
functions.  A  state  of  such  a  net  is  a  vector  of  O’s  and  I’s,  representing  the  states 
of  the  nodes  of  the  net.  When  we  say  that  a  net  is  in  state  vect,  we  mean  that 
for  all  z,  the  node  of  the  net  has  most  recently  computed  the  value  of  vect. 
Since  the  next  state  behaviour  of  a  net  is  completely  determined  by  the  recursion 
equations,  we  can  completely  compute  a  graph  of  the  possible  states  of  the  net.  In 
the  example  above  if  /  is  invert  and  g  is  and  then  the  state  graph  is  given  below: 


Figure  2 

This  exemplifies  a  mapping  from  boolean  neural  nets  to  state  spaces  and  points 
to  an  algebraic  relationship  whose  desirability  can  guide  us  in  our  setting  out  of 
our  algebraic  theory:  we  should  ensure  that  the  space  of  boolean  neural  networks 
is  fibred  over  the  space  of  boolean  state  spaces.  For  a  detailed  discussion  of  fibred 
categories  see  [2],  for  now  it  will  suffice  to  give  an  informal  definition. 

Recall  that  a  category  is  simply  a  collection  of  objects  and  maps  between  them. 
In  essence,  what  it  means  for  a  category  ,  E,  to  be  a  fibred  over  a  category,  C, 
is  that  there  is  a  projection  from  E  to  C  (the  part  of  E  that  gets  projected  to  a 
particular  C-object  C  constitutes  the  fibre  over  C)  such  that:  for  every  C-mapping 
/  :  C  — >■  C"  and  E-object,  E,  in  the  fibre  over  C,  there  is  a  unique  lifting  of  /  to  a 
map  starting  at  E  and  projecting  onto  /.  The  idea  is  that  what  happens  in  E  gets 
reflected  down  to  C,  and,  if  we  are  lucky,  what  happens  in  C  is  liftable  to  E.  This 
can  hold  for  maps  and  operations.  We  want  to  ensure  that  the  category  of  Neural 
Networks  is  in  this  relationship  to  the  category  of  state  spaces,  and  that  the  state 
space  composition  is  liftable. 
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4  Composition  in  the  State  Space 

We  wish  to  define  a  notion  of  composition  of  nets  that  is  a  lifting  of  state  space 
composition.  The  sequential  composition  of  two  graphs  with  the  same  nodes  is  quite 
well-known:  an  arc  in  the  composite  graph  is  an  arc  in  the  first  graph  followed  by  an 
arc  in  the  second.  Two  nets  with  the  same  nodes  automatically  yield  state  graphs 
with  the  same  nodes.  Our  first  lifting  of  state-space  composition  will  take  advantage 
of  this  fact  and  only  be  defined  for  networks  on  the  same  (perhaps  after  renaming) 
nodes.  We  shall  see  that  this  lifting  is  all  we  need  to  explain  the  closure  properties 
of  state  spaces  under  composition. 

To  continue  with  our  example,  consider  the  net  given  by  the  equations 

fn  +  l  —  fn  *  9n 

9n  +  l  =  ~/n- 

Notice  that  the  shape  of  this  net  differs  from  the  other.  The  shape  is  shown  by: 


Figure  3 


The  state  space  and  the  result  of  composing  it  with  the  graph  above  are  pictured 
below: 


Figure  4 


The  observation  that  initiated  this  work  is  that  the  composite  state  space  (and 
all  others  that  can  be  similarly  derived)  is  the  state  space  of  a  neural  network. 
That  is,  if  SS  denotes  the  function  that  sends  a  neural  net  to  its  state  space,  then 
for  any  two  boolean  nets,  N  and  M,  the  composition  SS{N)  *  SS{M)  turns  out 
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to  be  SS(som€ihing),  We  will  now  describe  an  operation  on  the  nets  such  that 
SS{N)  *  SS{M)  =  SS(N  *  M). 

5  Simple  Composition  of  Nets 

The  first  form  of  neural  network  composition  is  direct  functional  composition  on 
the  recursion  equations.  The  concept  is  most  readily  understood  in  terms  of  an 
example.  Thus,  to  continue  the  example  above: 

First  Net  f^+i  /„  and  Qn+i  =  fn  *  9n 

Second  Net  /„+i  =  fn*gn  and  /„ 

The  composition  consists  of  the  second  net  using  the  results  of  the  first  net: 
Composite  Net  fn+2  -  fn  +  l  *  gn  +  l  fn*  fn^Qn  =0 

9n+2  fn  +  l  /„  =  /„ 

Note  that  the  composite  net  works  with  n  +  2  as  one  step,  since  its  one  step  is  a 
sequence  of  one  step  in  each  of  two  nets.  With  that  rewriting  in  mind  (n  +  1  for 
n  +  2),  the  equations  for  the  composite  net  yield  the  composite  state  space  above. 
And  it  is  not  hard  to  see  that  this  will  hold  for  any  two  boolean  nets  with  the  same 
set  of  nodes. 

The  set  of  nets  with  the  same  nodes  is  a  monoid  under  this  operation:  the  compo¬ 
sition  is  associative  and  has  an  identity. 

6  Two  More  Complex  Notions  of  Sequential  Composition 

More  interesting  notions  of  composition  allow  us  to  compose  nets  with  different 
nodes.  The  first  of  these  compositions  requires  a  function  from  the  nodes  of  the 
second  net  to  the  nodes  of  the  first.  The  composition  is  then  very  like  the  simple 
one  except  that  instead  of  simply  using  the  same  names  to  say,  for  example, 

fn+2  =  fn  +  l  *  9n  +  l  /rz  *  /n  *  9n 

there  is  an  extra  dimension,. say 

fn+2  fn  +  l  ^  9ti+1 

in  the  second  net  and  translate  to 

m  n+l  n  +  l 

in  the  first  net. 

This  is  obviously  more  general  (the  simple  version  is  just  the  special  case  in  which 
/  is  the  identity)  and  we  believe  it  will  lead  to  rich  composition  theories. 

An  even  more  general  composition  comes  from  taking  the  union  of  these  compo¬ 
sitions  for  all  functions  (j).  This  composition  is  related  to  the  concept  of  wreath 
product  for  groups  and  automata  (see  [3]). 

7  What  use  is  a  Composition  Theory? 

A  proper  composition  theory  can  be  used  to  explain  how  complex  things  can  and 
do  get  built  from  simple  ones.  The  theory  can  be  used  to  break  things  down  as  well. 
For  example,  there  could  well  be  a  set  of  primitive  objects  and  a  theorem  that  all 
objects  can  be  built  from  them.  A  famous  instance  of  such  a  theorem  was  given  by 
Krohn  and  Rhodes: 

Theorem  [Krohn-Rhodes  4]  Given  a  finite  monoid,  M ,  M  divides  a  wreath 
product  of  simple  groups  and  switches  each  of  which  divides  M. 

And  an  instance  that  we’ve  been  using  is  given  by: 

Theorem  [5]  Given  a  finite  concrete  category,  C,  divides  a  wreath  product  of  prim¬ 
itive  minimal  categories  each  of  which  divides  C. 
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This  paper  marks  only  the  beginning  of  a  composition  theory  of  neural  nets.  The 
Krohn-Rhodes  theorem  has  fostered  a  whole  branch  of  algebra  that  has  shed  much 
light  on  automata  and  semigroups,  while  the  other  theorem  has  found  application  in 
various  fields  of  computing.  We  would  like  our  neural  net  composition  to  be  useful 
in  both  of  these  fashions.  The  mathematics  will  consist  of  a  study  of  the  category 
of  Networks  as  a  fibred  category  and  some  projects  that  natural  arise  are:  uncover 
a  set  of  primitives,  study  the  ordering  determined  by  the  decomposition,  devise  a 
homomorphism  theorem,  and  extend  the  work  to  other  operations,  such  as  parallel 
composition.  And  to  give  some  idea  of  the  kinds  of  application  we  envisage,  here 
are  two  applications  of  the  categorical  composition  theory: 

Designing  Chips:  We  take  an  algebraic  description  of  a  chip  design  and  use  the 
decomposition  to  implement  it  using  pre-defined  primitive  modules  (see  [6]). 
Problem  Solving:  We  turn  difficult  problems  into  sequences  of  problems  that  are 
more  easily  solved  (see  [7]). 

We  hope  to  find  applications  that  are  akin  to  these.  The  problem  solving  work 
automatically  makes  hierarchies  of  search  spaces.  Could  we  form  similar  neural 
network  hierarchies?  Moreover,  if  one  wanted  to  design  a  net  with  a  particular 
behaviour  this  might  be  a  modular  way  to  do  it.  And  to  finish  with  an  interesting 
entirely  open  question:  how  can  we  apply  a  theory  like  this  to  nets  as  they  are 
trained? 
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